AI Data Pipeline: How to Prepare Your Company Data

One of the most common sentences we hear at the initial client workshop: "We have thousands of documents, databases, a history of service records, customer emails — we definitely have data." Technically, that is true. But when we ask what a single representative record looks like and exactly where it comes from, things start to surface: files stored across three different versions of SharePoint, records duplicated with mutually contradictory values, documents scanned as images with no OCR, Excel spreadsheets with column names meaningful only to whoever created them back in 2018.

Data readiness is, according to multiple studies, the key differentiating factor between AI projects that delivered measurable value and those that ended as abandoned pilots. This article is not about which models to deploy. It is about what needs to happen before you get anywhere near the models.

Why "we have data" ≠ "the data is ready"

Data readiness for AI has a different definition than "data is stored somewhere". A model — whether it powers a RAG pipeline or fine-tuning — needs data in a specific state:

Machine-readable: text extracted from PDFs, images and tables in structured form
Consistent: the same entities named the same way (the customer "ACME s.r.o." vs. "Acme SR" vs. "ACME Slovakia" are three distinct entities from the model's perspective)
Relevant and clean: free of duplicates, outdated versions and internal "noise" (test records, sandbox data, draft documents)
Correctly annotated or structured: the model must know what the data represents, not merely see raw text
Accessible and manageable: clear ownership, version, validity date, access permissions

The Cisco AI Readiness Index reports that only ~34% of companies rate their data readiness as sufficient for AI deployment. From our experience in the field, that number sounds optimistic — most projects spend the first third of the entire engagement on data preparation, not on the models.

Phase 1: Inventory — what do we actually have

Before any technical decision, existing data sources must be mapped. In practice that means answering these questions:

Where is the data physically?

Databases (PostgreSQL, MySQL, MSSQL, Oracle)
Shared drives and document systems (SharePoint, Google Drive, NAS)
ERP / MES / SCADA systems (SAP, Infor, Siemens WinCC, Ignition)
Emails and communication tools
Paper documentation scanned as images
Machine logs and sensor data (MQTT, OPC-UA, industrial databases such as InfluxDB)

Each source has a different format, a different update frequency, a different degree of structure and different access permissions. A list of all sources is the first artefact of any data pipeline project.

What is the quality?

For each source, estimate:

Percentage completeness of key fields (how many records have all required fields filled in?)
Format consistency (are dates always in the same format? Are categories from a fixed code list or free text?)
Duplication (do records exist that represent the same entity more than once?)
Freshness (when was a record last verified? Do outdated versions exist?)

This audit does not need to be perfect — the goal is to identify critical issues that would block the project, not to achieve one-hundred-percent cleanliness before launch.

Who owns the data?

For each source: who is responsible for the content? Who has authority to change access? Who can explain what a record means? Without clear data ownership, an AI project can get stuck in cross-departmental disputes about who will approve their database being used for training or retrieval.

Phase 2: Extraction and normalisation

Once the sources are known, the technical work follows: getting the data into a unified form that the pipeline can work with.

Extraction from structured sources

From relational databases and ERP systems, extraction is relatively straightforward — SQL queries, API calls, CSV export. The critical points to define are:

1.Which tables / fields are relevant for the given use case (do not extract the entire ERP)
2.What is the update frequency (batch once a day, or streaming changes via CDC — Change Data Capture)
3.How technical field names map to human-readable labels (column ord_stat_cd → "order status")

Extraction from unstructured sources

Documents, emails, technical drawings and scanned forms are more demanding. The required steps:

OCR for scanned documents: tools such as Tesseract, Azure AI Document Intelligence or Google Document AI can extract text from images and PDFs. The quality of the OCR output directly affects the quality of the model's answers.
Chunking: long documents must be split into smaller segments before vectorisation. Naïve chunking at 512 tokens cutting mid-sentence is a common source of problems in RAG pipeline quality. Chunking by logical units (paragraphs, sections, tables) works better.
Metadata extraction: every chunk must carry metadata — where it came from, creation date, freshness, author, and optionally a category. This metadata is used for filtering during retrieval and for displaying the source in the response.

Entity normalisation

The same company in three different spellings is, to a vector model, three distinct entities with similar embeddings — but for fuzzy search or filtering it is a problem. Entity resolution — unifying different representations of the same entity — is work that is sometimes done manually (customer code lists), sometimes semi-automatically (fuzzy matching on company names) and sometimes with an LLM (when records are in natural language).

Phase 3: Cleaning and quality validation

Data cleaning is an iterative process. It cannot be done once perfectly — new data arrives, formats change, new error patterns appear.

Concrete steps we see in practice

Removing duplicates: identifying records that represent the same information under different identifiers
Correcting format errors: dates in varying formats, numeric values with the unit written into the field, phone numbers with varying prefix formats
Handling missing values: fill in where possible, explicitly mark where data is absent, decide to exclude records where a required field is missing and the record is unusable
Filtering irrelevant data: test records, sandbox environments, internal notes, documents classified as "draft" or "archive"
Freshness verification: for knowledge bases and RAG pipelines, old information is sometimes worse than no information. Price lists from two years ago, technical data sheets for discontinued products — these must either carry an explicit date in the metadata or be excluded.

Cleaning for fine-tuning dataset preparation adds an additional layer: every training example must have the correct input-output pair format, the output must be correct (not "our best-effort guess"), and the dataset must cover the main domain topics evenly. Gaps in the dataset translate directly into blind spots in the model.

Phase 4: Structure and access permissions

One of the most frequently underestimated aspects of data preparation is who is allowed to see what. In an enterprise environment there are layers of access permissions:

Commercial contracts and pricing visible only to the sales team
HR records visible only to the HR department
Technical documentation that is publicly available internally but must not reach clients
Regulatory documentation that must be auditable and cannot be changed without a record

When you deploy a RAG system that has access to the entire knowledge base regardless of these permissions, and an employee from the sales department asks about production costs (which only the production director may see) — the system will answer. That is not a feature; it is a security incident.

The correct solution: access permissions must be carried over into the vector database as well. Every document or chunk must have metadata specifying who is allowed to access it. During retrieval, filtering applies not only by relevance but also by whether the calling user has permission. This is a non-trivial architecture, but in deployments in regulated industries (GDPR, trade secrets) it is mandatory.

Phase 5: Pipeline and freshness

An AI data pipeline is not a one-time action — it is a continuous process. Company data changes every day: new orders, new documents, changed prices, updated technical data sheets.

Types of pipeline architecture

Batch pipeline (once daily / weekly): extraction, cleaning and re-indexing of the knowledge base at regular intervals. Suitable for most document-based RAG systems — tolerance for 24-hour "staleness" is acceptable for the majority of use cases.
Streaming / near-real-time: changes in the source system propagate to the vector database within minutes. Required for use cases where data becomes stale quickly (ticketing systems, live inventory, trading signals). Requires CDC (Change Data Capture) architecture.
Human-in-the-loop verification: for sensitive data (medical protocols, legal documents, safety manuals) every change to the knowledge base passes through human approval before appearing in the retrieval index.

The choice of architecture depends on the rate of change in source data and the use case's tolerance for staleness. For most B2B enterprise use cases, a batch pipeline is sufficient — simpler to operate, cheaper and adequately fresh.

Data quality monitoring

The pipeline must have alerting for data anomalies: a sudden drop in document count (possible deletion of the source repository), an increase in the proportion of empty fields (a format change in the source system), duplicates above the threshold (deduplication failure). Without monitoring, data problems surface only when the model starts giving bad answers — and it is not always straightforward to trace the cause back to the source.

Phase 6: Governance and documentation

This is the part technical teams dislike doing, but without which an AI project will fall apart when scaling.

What must be documented

Data dictionary: for every data source and every relevant field — what it is, in what format, who owns it, when it was last verified
Lineage: where the data came from, what transformations it passed through, where it is now — the ability to trace a specific model output back to a specific source document
Knowledge base change log: every change to the retrieval index must be recorded — what was added, what was changed, what was removed and when
Access permissions and their history: for GDPR compliance and audit purposes, a history of who had access to what and when is required

These requirements are not specific to AI — they are data governance standards that mature companies maintain for every critical system. AI merely highlights them because model outputs are directly visible and influence decisions.

Common pitfalls we see in practice

Before concluding, a few recurring patterns:

"Give it the entire SharePoint": volume is not the problem, relevance is. A model indexing 50,000 documents including presentations from 2015, test files and internal meeting minutes will have lower retrieval accuracy than a model indexing 2,000 carefully selected, clean documents. Less and better is almost always superior to more and low-quality.
"We'll fix it later": data quality in AI projects is not fixed "later". If you put a RAG system with poor-quality data into production, users will lose trust in it quickly — and regaining trust is far harder than building it in the first place.
"Our data is sensitive, but we're fine with it": GDPR and trade secret legislation apply to AI systems as well. If your knowledge base contains personal data, contracts with NDAs or confidential business information, you must have a legally consistent basis for processing it in an AI pipeline.
"We'll fine-tune on what we have": as described in the article on fine-tuning dataset preparation, insufficient volume or poor domain coverage in training data produces a model that answers incorrectly with high confidence — which is worse than a model that says "I don't know". Data sufficiency is a precondition for fine-tuning, not an optimistic assumption.

Practical steps to get started

If you want to start without getting lost in theory:

1.Choose one use case, not "the whole company". One concrete problem (answering technical customer queries, searching service documentation, generating quotes from a catalogue) has a bounded data scope.
2.Map the data sources for this use case — where they are, what format, who owns them, how fresh they are.
3.Conduct a data audit on a sample — take 100 random records and manually assess quality. Most problems will be visible.
4.Define the minimum quality for production deployment — not "perfect", but "good enough for this use case".
5.Design the pipeline — batch or streaming, access permissions, monitoring.
6.Iterate: the first index will be imperfect. Evaluate RAG answer quality, identify the most frequent causes of bad answers, fix the data or the pipeline, repeat.

Frequently asked questions

How much data do we need for a RAG system?

For a functional RAG system there is no lower limit on document count — even 50 well-prepared documents can deliver excellent results if they cover the relevant domain scope. More critical than quantity is coverage: if users ask questions for which the knowledge base contains no answer, the model will either admit it (a well-configured guardrail) or make one up (a bad state). Inventory the questions the system must be able to answer and verify that they are answerable from the available data.

Can we use internal data for fine-tuning without legal risk?

It depends on the nature of the data. Internal technical manuals, SOP documents and historical records without personal data are typically fine — they belong to the company and contain no PII. Records with customer data, contracts or employee personal data require a legal basis under GDPR (consent, legitimate interest, performance of a contract) even for processing in an AI pipeline. We recommend consulting your DPO before you begin building a training dataset from regulated data.

How long does data preparation take?

In practice we see a range from two weeks (one well-defined source, good internal documentation, cooperative data owners) to six months (data fragmented across three different systems, no existing data governance, internal political obstacles to access). The average project spends 30–40% of its total time on data preparation. If a client claims they can do it in a week, it is either a very small scope or an underestimation of the problem's scale.

Do we need a data engineer or can an AI developer handle it?

For simpler use cases (PDF documents, straightforward SharePoint export) an AI developer can manage basic extraction and chunking. For more complex scenarios — streaming pipelines from ERP systems, CDC architecture, entity normalisation across multiple sources, GDPR-compliant lineage — an experienced data engineer is required. Underestimating this role is one of the most common causes of AI project delays.

Does the data need to be in Slovak, or does English work too?

For a RAG system on company documents, the language of the documents is primary — embedding models are used multilingually (models such as BGE-M3 cover 100+ languages including Slovak), and retrieval works well in both languages. For fine-tuning, the language of the training data is critical — if you want a model that responds in professional Slovak with company-specific terminology, the training examples must be in Slovak with that terminology.

*If you are working on preparing company data for an AI project and want to avoid unnecessary mistakes, we are happy to walk through a data audit with you and propose a pipeline architecture suited to your use case. Contact us for a no-obligation consultation.*