GDPR and LLM on Company Data: How to Deploy AI Without a Data Leak

A company decides to deploy an LLM over its internal documents. IT sets up an API key, a developer writes a wrapper, and people start feeding the model quotes, contracts, and meeting notes. The first week looks great. Then the DPO walks in and asks one question: "Where does this data end up?" And nobody has an answer.

This scenario is not the exception — it is the rule in many EU companies. GDPR in LLM deployments is not an academic problem. It is a concrete checklist that you either have or you don't. This article breaks it down practically: exactly where data flows, what you must have signed, when a DPIA is required, and how a decision-maker can rely on a structured framework rather than gut feeling.

Where data flows — a map of the pipeline

Before you address anything legal, you need to understand what is happening technically. An LLM in production is not a single system — it is a chain of components, each of which is a potential point of leakage.

A typical flow with a cloud LLM looks like this:

Input document (contract, email, report) → chunking → embedding → vector database
User query → retrieval from vector DB → context + query sent to LLM API
LLM API (e.g. OpenAI, Anthropic, Google) → processes the prompt → returns a response
Logs and traces → observability (Langfuse, LangSmith, custom logging) → possible additional storage

Every one of these steps can contain personally identifiable information (PII): a customer's name in a contract, the email address of a contact person, health information in a report, an IBAN in a payment document. If that data leaves your perimeter — and with a cloud API it does — you are a processor of personal data with specific obligations.

Another flow that is often overlooked: training and fine-tuning. Some cloud providers explicitly state that data sent via their API is not used for training (OpenAI at the enterprise tier, Anthropic via the API). But default settings vary — and what is true today may change with a terms update. Always verify the current Terms of Service and lock it down contractually.

This is the most important architectural choice from a compliance perspective.

Cloud API (OpenAI, Anthropic, Gemini, Azure OpenAI...)

The advantages are obvious: the strongest models, zero infrastructure, fast start. The GDPR issues:

Data leaves your perimeter — you are required to have a Data Processing Agreement (DPA) with the provider as a processor
When transferring data outside the EEA (e.g. data on US servers) you need a transfer mechanism — Standard Contractual Clauses (SCC) or an equivalent legal basis
You must know where the servers physically are; the answer "in the cloud" does not satisfy a DPO or a supervisory authority

Most major providers do offer a DPA — but you must actively conclude it, not just tick a checkbox. Azure OpenAI has European regions, which simplifies transfers outside the EEA. Check the current terms directly with each provider.

On-prem / self-hosted LLM

Data does not leave your perimeter. If the model runs on your own servers (or on servers in the EEA under a clear contract), the GDPR exposure is considerably smaller. This choice makes the most sense for regulated industries — healthcare, legal, finance — where sensitive personal data is processed. For a deeper look at this topic, see on-prem LLM for regulated industries.

The trade-off is quality and cost: self-hosted open-weight models (Llama, Qwen, Mistral, DeepSeek in their open-weight variants) now reach production quality for most enterprise use cases, but they require GPU infrastructure and engineering capacity to operate. Running a 7B–14B model with reasonable latency requires roughly one server with a GPU carrying sufficient VRAM; larger models require more.

DPA — what it must contain and why a click is not enough

A Data Processing Agreement is a statutory requirement under Article 28 GDPR whenever you hand personal data to a third party for processing. An LLM provider to whom you send documents containing personal data is a processor. You are the controller. Without a DPA you are in breach of the law.

What a DPA must contain (per Art. 28(3) GDPR):

Subject matter and duration of the processing — specific, not vague
Nature and purpose of the processing — "inference for an internal chatbot" is more specific than "AI services"
Type of personal data and categories of data subjects — customers? employees? patients?
Obligations and rights of the controller — including the right to audit
Technical and organisational measures (TOMs) — encryption, access control, incident response
Prohibition on engaging further sub-processors without consent, or a general authorisation with a lawful mechanism

One detail companies regularly underestimate: the DPA must also cover the provider's sub-processors — cloud infrastructure, logging services, monitoring tools. Larger providers publish lists of sub-processors; review them.

Legal basis — on what ground are you processing

GDPR requires every processing of personal data to have a legal basis (Art. 6). In enterprise LLM deployments the most common ones are:

Legitimate interest (Art. 6(1)(f)) — the most common basis for internal business tools; requires a Legitimate Interest Assessment (LIA) documenting that the company's interest outweighs the rights of the data subjects
Performance of a contract (Art. 6(1)(b)) — relevant where the LLM directly serves the fulfilment of a contract with a customer
Consent (Art. 6(1)(a)) — impractical for most internal tools; consent must be freely given, informed, and revocable

If you process special categories of data (health, genetic, biometric, political, religious, trade-union, sexual orientation) — Art. 9 GDPR — the legal basis is stricter and consent is almost always required, or you must demonstrate another basis from the exhaustive list in Art. 9(2).

Practically speaking: with an LLM running over HR documentation, medical records, or legal filings, you cannot move forward without a lawyer. Otherwise you risk not only a fine but having to halt and dismantle the entire project.

PII scrubbing — when and how

PII scrubbing (cleaning personal data before it is sent to the LLM) is a technical measure that reduces exposure. It is not a substitute for a proper legal basis or DPA, but it significantly narrows the risk surface.

Where it makes sense:

Documents where the LLM does not need to know a specific name — it needs to understand context, structure, and content
Log records — PII should never appear in logs; that is both a legal requirement and sound technical hygiene
Training datasets — if you fine-tune a model on internal data, PII in the training data can be "memorised" by the model (the memorisation problem) and later reproduced

How it works technically:

Regex-based detectors — emails, phone numbers, IBANs, national ID numbers, IP addresses; fast, deterministic, low false-negative rate for structured formats
NER-based detectors (Named Entity Recognition) — person names, companies, addresses; captures unstructured text; requires a model and has a higher error rate
Combination — regex for structured formats + NER for free text; the production standard

An important caveat: pseudonymisation is not anonymisation under GDPR. If you replace a data point with a token but a key exists to reconstruct the original, it is still personal data. The EDPB (European Data Protection Board) has repeatedly confirmed that LLMs rarely meet the standard of true anonymisation. Scrubbing reduces risk — it does not eliminate it.

For implementation inspiration, you can see how this cleaning pipeline layer is addressed in guardrails for AI agents, where input validation and PII detection form the first line of defence before the LLM.

Data minimisation — nothing extra

Data minimisation is one of the core principles of GDPR (Art. 5(1)(c)): process only the personal data that is strictly necessary for the given purpose.

For LLMs, this means in practice:

Only relevant chunks go into the prompt, not entire documents — a RAG pipeline correctly tuned for precise retrieval is also a compliance instrument
Retention policy for logs and traces — how long do you keep conversation history? 30 days? 90 days? Without a defined policy it lasts "forever"
Embedding vectors — even a vector representation of a document can constitute personal data if the original can be reconstructed or a person identified from it; treat embeddings the same as source data
Inference system logs — if you log the full prompt including the user's query, you are logging potentially personal data; introduce selective logging or log scrubbing

DPIA — when it is mandatory and what it must contain

A Data Protection Impact Assessment (DPIA) is mandatory for processing that is "likely to result in a high risk to the rights and freedoms of natural persons" (Art. 35 GDPR). For LLM deployments this typically means:

Systematic and large-scale processing of special categories of data (health, financial, HR)
Automated decision-making with a legal or similarly significant effect on individuals
Large-scale monitoring of publicly accessible spaces

Practically: if your LLM helps make decisions about employees, customers, or patients, a DPIA is appropriate. If it is an internal chatbot over technical documentation with no special categories of data, a DPIA is probably not mandatory — but we recommend at minimum an internal risk assessment.

What a DPIA must contain:

Description of the processing operation — what, why, who, where
Assessment of necessity and proportionality — can the objective be achieved with less privacy impact?
Assessment of risks to the rights and freedoms of data subjects
Measures to address the risks — both technical and organisational
If residual risk remains high → consultation with the supervisory authority (e.g. the Slovak Data Protection Authority)

Since August 2025, obligations for GPAI (General Purpose AI) models under the EU AI Act have been in force. For deploying companies, obligations under Article 26 apply — in particular for high-risk AI systems.

An important point: the EU AI Act and GDPR do not duplicate each other — they complement each other. GDPR governs the protection of personal data. The EU AI Act governs the risks of AI systems — including situations where AI makes or influences decisions about people. If your LLM system helps make decisions relating to HR, credit, access to services, or security classification, it may be classified as a high-risk AI system — and you then have obligations under both regulations simultaneously.

For a closer look at the specific obligations for companies, see the article on EU AI Act and company obligations.

Practical compliance checklist

A summary control list before deploying an LLM over company data:

DPA with every LLM provider — signed, not merely clicked; covers sub-processors as well
Legal basis documented — LIA for legitimate interest, consent where required; stored in writing, not just in someone's head
Data transfers outside the EEA addressed — SCC or equivalent if the provider runs on US servers
PII scrubbing — deployed at minimum for logs; for documents wherever technically feasible
Data minimisation policy — retention periods for logs, conversations, and embeddings defined and enforced
DPIA — completed for high-risk use cases; results documented
Technical and organisational measures — encryption in transit and at rest, access control, incident response plan
Records of processing activities (Art. 30 GDPR) — the LLM system must be included in the Records of Processing Activities
EU AI Act assessment — is your system high-risk? If so, obligations under Art. 26 apply

Frequently asked questions

Do we need a lawyer, or can we handle this internally?

It depends on the scope. For a simple internal chatbot over public technical documentation with no personal data, an internal assessment together with the DPO is sufficient. When you process special categories of data, handle transfers outside the EEA, or your system influences decisions about people, an external lawyer specialising in data privacy and AI pays for itself — GDPR fines can reach 4% of global annual turnover or €20 million.

If you have achieved true anonymisation — meaning the data cannot be attributed to a natural person even indirectly — GDPR does not apply to it. In practice this is hard to achieve. LLMs rarely meet the standard of true anonymisation; pseudonymisation (replacement with a token that can be reconstructed) is not sufficient. Consult your DPO on whether your method genuinely meets the anonymisation standard.

Do we have to carry out a DPIA for every LLM project?

Not for every project. A DPIA is mandatory where there is a high risk to the rights of individuals — typically involving special categories of data, automated decision-making with a significant effect on people, or large-scale monitoring. For an internal helpdesk over technical manuals with no personal data, a DPIA is not mandatory. We recommend at minimum a short internal risk assessment and documenting the decision.

What happens if the LLM provider changes its terms and starts training on our data?

That is a real risk. That is exactly why the DPA must contain a prohibition on using your data for training and a right to audit. Monitor changes to the Terms of Service and set up a review process at least once a year. If the provider changes its terms in a way that conflicts with the DPA, you have the right to terminate the contract and request deletion of your data.

Not automatically. On-prem removes the risk of transferring data to a third-party processor, but you still need a defined legal basis, a data minimisation policy, retention periods, and technical measures. The difference is that you address the issues internally — rather than through a DPA with an external provider.

*GDPR and LLM on company data is not a weekend project, but it also cannot be the reason a rollout stalls. Most of the companies we work with primarily need a structured picture of what they have covered and what they don't — not dozens of hours of legal consultations. If you want to walk through this checklist against a specific deployment at your company, we are available for an initial consultation.*

Where data flows — a map of the pipeline

A typical flow with a cloud LLM looks like this:

Input document (contract, email, report) → chunking → embedding → vector database
User query → retrieval from vector DB → context + query sent to LLM API
LLM API (e.g. OpenAI, Anthropic, Google) → processes the prompt → returns a response
Logs and traces → observability (Langfuse, LangSmith, custom logging) → possible additional storage

This is the most important architectural choice from a compliance perspective.

Cloud API (OpenAI, Anthropic, Gemini, Azure OpenAI...)

The advantages are obvious: the strongest models, zero infrastructure, fast start. The GDPR issues:

Data leaves your perimeter — you are required to have a Data Processing Agreement (DPA) with the provider as a processor
When transferring data outside the EEA (e.g. data on US servers) you need a transfer mechanism — Standard Contractual Clauses (SCC) or an equivalent legal basis
You must know where the servers physically are; the answer "in the cloud" does not satisfy a DPO or a supervisory authority

On-prem / self-hosted LLM

DPA — what it must contain and why a click is not enough

What a DPA must contain (per Art. 28(3) GDPR):

Subject matter and duration of the processing — specific, not vague
Nature and purpose of the processing — "inference for an internal chatbot" is more specific than "AI services"
Type of personal data and categories of data subjects — customers? employees? patients?
Obligations and rights of the controller — including the right to audit
Technical and organisational measures (TOMs) — encryption, access control, incident response
Prohibition on engaging further sub-processors without consent, or a general authorisation with a lawful mechanism

Legal basis — on what ground are you processing

GDPR requires every processing of personal data to have a legal basis (Art. 6). In enterprise LLM deployments the most common ones are:

Legitimate interest (Art. 6(1)(f)) — the most common basis for internal business tools; requires a Legitimate Interest Assessment (LIA) documenting that the company's interest outweighs the rights of the data subjects
Performance of a contract (Art. 6(1)(b)) — relevant where the LLM directly serves the fulfilment of a contract with a customer
Consent (Art. 6(1)(a)) — impractical for most internal tools; consent must be freely given, informed, and revocable

PII scrubbing — when and how

Where it makes sense:

Documents where the LLM does not need to know a specific name — it needs to understand context, structure, and content
Log records — PII should never appear in logs; that is both a legal requirement and sound technical hygiene
Training datasets — if you fine-tune a model on internal data, PII in the training data can be "memorised" by the model (the memorisation problem) and later reproduced

How it works technically:

Regex-based detectors — emails, phone numbers, IBANs, national ID numbers, IP addresses; fast, deterministic, low false-negative rate for structured formats
NER-based detectors (Named Entity Recognition) — person names, companies, addresses; captures unstructured text; requires a model and has a higher error rate
Combination — regex for structured formats + NER for free text; the production standard

Data minimisation — nothing extra

Data minimisation is one of the core principles of GDPR (Art. 5(1)(c)): process only the personal data that is strictly necessary for the given purpose.

For LLMs, this means in practice:

Only relevant chunks go into the prompt, not entire documents — a RAG pipeline correctly tuned for precise retrieval is also a compliance instrument
Retention policy for logs and traces — how long do you keep conversation history? 30 days? 90 days? Without a defined policy it lasts "forever"
Embedding vectors — even a vector representation of a document can constitute personal data if the original can be reconstructed or a person identified from it; treat embeddings the same as source data
Inference system logs — if you log the full prompt including the user's query, you are logging potentially personal data; introduce selective logging or log scrubbing

DPIA — when it is mandatory and what it must contain

Systematic and large-scale processing of special categories of data (health, financial, HR)
Automated decision-making with a legal or similarly significant effect on individuals
Large-scale monitoring of publicly accessible spaces

What a DPIA must contain:

Description of the processing operation — what, why, who, where
Assessment of necessity and proportionality — can the objective be achieved with less privacy impact?
Assessment of risks to the rights and freedoms of data subjects
Measures to address the risks — both technical and organisational
If residual risk remains high → consultation with the supervisory authority (e.g. the Slovak Data Protection Authority)

For a closer look at the specific obligations for companies, see the article on EU AI Act and company obligations.

Practical compliance checklist

A summary control list before deploying an LLM over company data:

DPA with every LLM provider — signed, not merely clicked; covers sub-processors as well
Legal basis documented — LIA for legitimate interest, consent where required; stored in writing, not just in someone's head
Data transfers outside the EEA addressed — SCC or equivalent if the provider runs on US servers
PII scrubbing — deployed at minimum for logs; for documents wherever technically feasible
Data minimisation policy — retention periods for logs, conversations, and embeddings defined and enforced
DPIA — completed for high-risk use cases; results documented
Technical and organisational measures — encryption in transit and at rest, access control, incident response plan
Records of processing activities (Art. 30 GDPR) — the LLM system must be included in the Records of Processing Activities
EU AI Act assessment — is your system high-risk? If so, obligations under Art. 26 apply

Six pillars,one delivery.

Industry & engineering

Electrical & automation

Automation & Control

Data centres & server rooms

AI, software & cloud

Smart home & IoT

GDPR and LLM on Company Data: How to Deploy AI Without a Data Leak

Where data flows — a map of the pipeline

DPA — what it must contain and why a click is not enough

Legal basis — on what ground are you processing

PII scrubbing — when and how

Data minimisation — nothing extra

DPIA — when it is mandatory and what it must contain

Practical compliance checklist

Frequently asked questions

Do we need a lawyer, or can we handle this internally?

Do we have to carry out a DPIA for every LLM project?

What happens if the LLM provider changes its terms and starts training on our data?

GDPR and LLM on Company Data: How to Deploy AI Without a Data Leak

Where data flows — a map of the pipeline

DPA — what it must contain and why a click is not enough

Legal basis — on what ground are you processing

PII scrubbing — when and how

Data minimisation — nothing extra

DPIA — when it is mandatory and what it must contain

Practical compliance checklist

Frequently asked questions

Do we need a lawyer, or can we handle this internally?

Do we have to carry out a DPIA for every LLM project?

What happens if the LLM provider changes its terms and starts training on our data?

GDPR and LLM on Company Data: How to Deploy AI Without a Data Leak

Where data flows — a map of the pipeline

Cloud API vs. on-prem — a decision with GDPR implications

DPA — what it must contain and why a click is not enough

Legal basis — on what ground are you processing

PII scrubbing — when and how

Data minimisation — nothing extra

DPIA — when it is mandatory and what it must contain

EU AI Act and GDPR — two compliance layers

Practical compliance checklist

Frequently asked questions

Do we need a lawyer, or can we handle this internally?

Can I use anonymised data and avoid GDPR?

Do we have to carry out a DPIA for every LLM project?

What happens if the LLM provider changes its terms and starts training on our data?

Is an on-prem LLM automatically GDPR-compliant?

GDPR and LLM on Company Data: How to Deploy AI Without a Data Leak

Where data flows — a map of the pipeline

Cloud API vs. on-prem — a decision with GDPR implications

DPA — what it must contain and why a click is not enough

Legal basis — on what ground are you processing

PII scrubbing — when and how

Data minimisation — nothing extra

DPIA — when it is mandatory and what it must contain

EU AI Act and GDPR — two compliance layers

Practical compliance checklist

Frequently asked questions

Do we need a lawyer, or can we handle this internally?

Can I use anonymised data and avoid GDPR?

Do we have to carry out a DPIA for every LLM project?

What happens if the LLM provider changes its terms and starts training on our data?

Is an on-prem LLM automatically GDPR-compliant?