The security side of getting data AI-ready

Louis De Gouveia, data competency manager at iOCO.

In my previous article, I covered the principles crucial to getting data AI-ready; namely, data must be: diverse, timely, accurate, secure, discoverable and easily consumable by machines. Here I expand on the remaining principles and the all-important issue of security.

Artificial intelligence (AI) systems often use sensitive data − including personally identifiable information, financial records, or proprietary business information − and use of this data requires responsibility.

Criminals are very capable of stealing sensitive information, manipulating training data to bias outcomes, or even disrupting entire generative AI (GenAI) systems. Securing data is crucial to privacy protection, maintaining model integrity and guaranteeing the responsible development of powerful AI applications.

Three tactics can help companies to automate data security at scale, since it’s virtually impossible to do it manually. Data classification detects, categorises and labels data that feeds the next stage. Data protection defines policies like masking, tokenisation and encryption to conceal the data. Finally, data security defines policies that describe access control, such as who can access the data.

The three concepts work together as follows: first, privacy tiers should be defined and data tagged with a security designation of sensitive, confidential, or restricted. Next, a protection policy needs to be applied to mask restricted data. Finally, an access control policy must be implemented to limit access rights.

Data transformation is regarded as the unsung hero of consumable data for machine learning.

Next, data needs to be discoverable. AI-ready data must be discoverable and readily accessible within the system. Discoverable data unlocks the true potential of machine learning (ML) and GenAI, allowing these workloads to find the information they need to learn, adapt and produce groundbreaking results.

Good metadata practices drive discoverability. Beyond technical metadata, defining business metadata and semantic typing enhances both automated and human understanding. All metadata is then indexed and searchable via a data catalogue.

Data must be easily consumable by ML or large language models (LLMs). AI initiatives won’t be successful if the data is not in the right format for ML experiments or LLM applications.

The true potential of ML and GenAI applications rests with the ability to readily consume data. Unlike humans who can decipher handwritten notes or navigate messy spreadsheets, these technologies require information to be represented in specific formats.

Making data easily consumable helps unlock the potential of these AI systems, allowing them to ingest information smoothly and translate it into intelligent actions for creative outputs.

Data transformation is regarded as the unsung hero of consumable data for ML. While algorithms like linear regression grab the spotlight, the quality and shape of the data they’re trained on are just as critical.

Moreover, the effort invested in cleaning, organising and making data consumable by ML models reaps significant rewards. Prepared data empowers models to learn effectively, leading to accurate predictions, reliable outputs and, ultimately, the success of the entire ML project.

However, training data formats depend highly on the underlying ML infrastructure. Traditional ML systems are disk-based, and much of the data scientist workflow focuses on establishing best practices and manual coding procedures for handling large volumes of files.

More recently, lakehouse-based ML systems have used a database-like feature store, and the data scientist workflow has transitioned to SQL as a first-class language. As a result, well-formed, high-quality, tabular data structures are the most consumable and convenient data format for ML systems.

Making data consumable for GenAI

Large language models (LLMs) − like OpenAI’s GPT-4, Anthropic’s Claude and Google AI’s LaMDA and Gemini − have been pre-trained on masses of text data and lie at the heart of GenAI.

OpenAI’s GPT-3 model was estimated to be trained with approximately 45TB of data, exceeding 300 billion tokens. Despite this wealth of inputs, LLMs can’t answer specific questions about your business, because they don’t have access to the company’s data.

The solution is to augment these models with your company’s own information, resulting in more correct, relevant and trustworthy AI applications.

The method for integrating corporate data into an LLM-based application, in a safe and secure way, is called retrieval-augmented generation.

The technique generally uses text information derived from unstructured, file-based sources, such as presentations, mail archives, text documents, PDFs, transcripts, etc. The text is then split into manageable chunks and converted into a numerical representation used by the LLM in a process known as embedding.

These embeddings are then stored in a vector database like Chroma, Pinecone and Weviate. Interestingly, many traditional database vendors − such as PostgreSQL, Redis and SingleStoreDB − also support vectors. Moreover, cloud platforms like Databricks, Snowflake and Google BigQuery have recently added support for vectors, too.

In conclusion, despite the transformative power of ML, plus GenAI’s explosive growth potential, data readiness remains the cornerstone of any successful AI implementation.

The key principles I have discussed for establishing a robust and trusted data foundation combine to help your organisation to unlock the true potential of AI.

Source link

Post Views: 5

What's Hot

Nigeria Launches AfCFTA Air Corridor, Boosting Trade with Three Countries

26 killed in Israeli tank fire near aid centre, medics say

Apple to rename its operating systems

The security side of getting data AI-ready

Apple to rename its operating systems

Teksi Ride to add electric vehicle service

From Zoom rooms to mine shafts: how labour law defines the workplace

Who is Duma Boko, Botswana’s new President?

As African Leaders Gather in Addis Ababa to Pick a New Chairperson, They are Reminded That it is Time For a Leadership That Represents True Pan-Africanism

BREAKING NEWS: Tapang Ivo Files Federal Lawsuit Against Nsahlai Law Firm for Defamation, Seeks $100K in Damages

Kamto Not Qualified for 2025 Presidential Elections on Technicality Reasons, Despite Declaration of Candidacy

Nigeria Launches AfCFTA Air Corridor, Boosting Trade with Three Countries

26 killed in Israeli tank fire near aid centre, medics say

Apple to rename its operating systems

WAFCON 2025 the Target as Banyana Kick Off #ThreeNations Series against Botswana

Our Picks

Nigeria Launches AfCFTA Air Corridor, Boosting Trade with Three Countries

26 killed in Israeli tank fire near aid centre, medics say

Apple to rename its operating systems

Most Popular

Nigeria Launches AfCFTA Air Corridor, Boosting Trade with Three Countries

Did Paul Biya Actually Return to Cameroon on Monday? The Suspicion Behind the Footage

Surrender 1.9B CFA and Get Your D.O’: Pirates Tell Cameroon Gov’t

Subscribe to Updates

What's Hot

The security side of getting data AI-ready

Making data consumable for GenAI

Related Posts