In regulated industries, most AI deployment conversations end at the same question: "And where will our data live?" When a chief physician, a bank's compliance officer, or a law firm partner asks that question, it is not rhetorical — a wrong answer carries fines, licence revocations, or criminal liability.
This article is not about *whether* to go on-prem — we covered that in the comparison of local LLMs vs. cloud. This is about *how* to build an on-prem LLM infrastructure that holds up under regulatory scrutiny, an IT audit, and daily operations.
Why cloud falls short — even when the provider promises GDPR compliance
Cloud providers have excellent security certifications. The problem is not their technology — it is the legal framework and the data-flow architecture.
When you send a prompt to an external API, data physically leaves your infrastructure. Even if the provider does not persist your requests (and most enterprise tiers now claim they do not), from the perspective of GDPR Article 28 you have entered into a data-processor relationship. That requires a signed Data Processing Agreement (DPA), third-party due diligence, and records of processing activities.
Healthcare organisations must additionally contend with HIPAA in the US, or the local transposition of EDPB guidance in the EU. Banks face the EBA ICT Risk framework and DORA. For law firms the question is even simpler: attorney-client privilege makes no distinction between paper and an API request.
On-prem eliminates data egress risk by design. Not a single token belonging to your patients, clients, or transactions leaves your network. That is not a marketing framing — it is an auditable technical fact.
At the same time, let us be honest: on-prem alone is not enough for compliance. A regulator wants to see audit logs, access controls, encryption at rest and in transit, a documented incident response process, and regular risk assessments. "It runs on our server" is a starting point, not the destination.
What an on-prem LLM architecture must contain
Before choosing a GPU and a model, you need to define what you are actually building. A functional on-prem LLM architecture for regulated industries has five layers:
1. Serving engine and inference layer
For production multi-user deployments, two main frameworks are relevant:
- `vLLM` — the industry standard for high-throughput serving. PagedAttention dramatically reduces KV cache fragmentation; continuous batching eliminates waiting for the slowest request in a batch. The broadest hardware support (NVIDIA, AMD, Gaudi).
- `SGLang` — advantageous for RAG workloads and multi-turn dialogues thanks to RadixAttention, which caches the KV cache of shared prefixes. On prefix-heavy workloads it achieves higher throughput and lower time-to-first-token (TTFT) than vLLM.
For single-developer experiments and pilots, Ollama is fine. For a production system with dozens of concurrent users it is under-powered — underneath it runs llama.cpp, which is not designed for concurrent requests, and the throughput gap becomes significant with multiple parallel requests.
2. Model — what you can afford and what it can do
Model selection for on-prem is primarily a hardware question. Available VRAM determines what you can run.
Indicative VRAM requirements for inference (for formats commonly used on-prem):
- 7–9B model: ~5–7 GB VRAM at Q4_K_M quantisation, ~8–13 GB at Q8_0
- 13B model: ~8 GB at Q4_K_M, ~14 GB at Q8_0
- 34B model: ~17–20 GB at Q4_K_M, ~30–34 GB at Q8_0
- 70B model: ~35–40 GB at Q4_K_M, ~70–75 GB at Q8_0
Add to that the KV cache — with long contexts it can be as large as the model weights themselves. For a production deployment with multiple concurrent requests and medium-length contexts, budget significant headroom above the VRAM needed for the weights alone.
Open-weight models that make sense for regulated industries in 2026:
- Llama 4 Maverick and Scout (Meta, custom licence) — MoE architecture, strong performance, 1M+ context. The Meta custom licence is sufficient for most enterprise internal deployments.
- Qwen 3 family (Alibaba, Apache 2.0) — excellent performance on document-heavy tasks, multilingual support including European languages, permissive licence.
- Mistral Small (Apache 2.0) — European provider (a plus for GDPR argumentation), permissive licence. The larger Mistral Large carries its own Mistral licence — verify it before any commercial on-prem deployment.
- Phi-4 (Microsoft, MIT) — for use cases where 7–14B parameter capacity is sufficient; low hardware requirements, good instruction following.
For regulated industries we recommend models with a permissive licence (Apache 2.0, MIT) — commercial use is unambiguous and the licence audit is straightforward.
3. Quantisation — a trade-off that is usually acceptable
Quantisation reduces the VRAM footprint and increases throughput at the cost of slightly lower accuracy. For regulated industries the key question is: what trade-off is acceptable for the specific task?
A practical overview of formats:
- Q8_0 (GGUF): retains ~98–99 % of quality compared with FP16, minimal loss. For critical tasks (legal analysis, medical documentation) this is the safe choice.
- Q4_K_M (GGUF): ~92–95 % quality, significantly lower VRAM requirements. The sweet spot for most documentation and RAG use cases. The difference versus Q8 is hard to notice in ordinary conversation.
- AWQ 4-bit: suitable for NVIDIA GPUs, better output coherence on long generations than GPTQ.
- Q2 and below: significant quality degradation — not recommended for regulated industries.
An important note: perplexity differences between Q4_K_M and BF16 are below 6 % on most benchmarks. That does not mean every task is equally robust — complex multi-step reasoning and precise structured-information extraction may be more sensitive. Always validate the model on a sample of real data from your domain before production deployment.
4. Data layer and RAG
For most regulated use cases the model alone is not enough — you need to connect it to internal documentation, regulations, and case history. This is where RAG (Retrieval-Augmented Generation) comes in.
Key components:
- Locally deployed vector database:
Qdrant(open-source, Rust backend, GDPR-friendly European company),pgvector(a PostgreSQL extension, straightforward if you already run PG), orMilvus. - Local embedding model:
BGE-M3(BAAI) covers multiple European languages and retrieval types in a single model. Runs locally — no cloud. - Chunking and metadata: for medical records or legal documents, structured chunking by logical units (article, paragraph, case) is significantly better than naive splitting by N tokens.
The context window of modern models (1M+ tokens) is tempting, but it is not a replacement for RAG in a production system. KV cache for a 1M context consumes tens of additional GB of VRAM, and TTFT latency grows dramatically. For most documentation use cases, a hybrid approach (retrieval + shorter context) is better both economically and in terms of performance.
5. Audit, access controls, and monitoring
This is the layer technical teams most often defer — and the one regulators scrutinise most closely.
Minimum requirements for a regulated on-prem LLM:
- Audit log for every request: who asked, when, what the prompt was (or its hash), what the output was (or its hash), which model version responded. Logs must be tamper-evident (write-once storage or signing).
- Role-based access: a physician sees their own patients' records, not the entire hospital. The LLM endpoint must respect the same ACL rules as the rest of the system.
- Encryption at rest and in transit: model weights, vector database, logs — everything encrypted. TLS for all internal communication.
- Network isolation: the LLM inference server should not have direct internet access. Air-gap or a minimal egress firewall for the serving node.
- Model version pinning: in regulated industries you must be able to state which model version made a decision — even a year later. Weight versioning and deterministic reproducibility are audit requirements.
Hardware — what you actually need
On-prem LLM is not a cheap solution. It is an investment that makes sense where the alternative — cloud compliance overhead, risk-of-breach insurance, regulatory fines — is more expensive.
For reference in 2026:
- Entry level (7–13B model, 1–5 concurrent users): a single NVIDIA RTX 4090 (24 GB VRAM) or A4000 (16 GB VRAM). Sufficient for a 13B model at Q4_K_M; for 13B at Q8_0 you need either dual-GPU or a 4090.
- Mid tier (34B model or 70B at Q4_K_M, 5–20 concurrent users): two A5000s (24 GB × 2 = 48 GB), an A6000 (48 GB), or the consumer route — two RTX 4090s in tensor parallelism over NVLink/PCIe.
- Production tier (70B at Q8_0 or larger, 20+ concurrent users): A100 80 GB or H100 80 GB. A single H100 comfortably serves a 70B Q8_0 model with reasonable latency.
- Apple Silicon alternative: M4 Ultra / M5 Ultra with 128–192 GB unified memory is a viable on-prem option for 70B FP16 where a quiet server room and low power consumption are priorities. Throughput is lower than an H100, but for an internal deployment with low concurrency it can be sufficient.
Do not forget CPU memory — with CPU offloading (when GPU VRAM is insufficient) part of the model runs in RAM. For a production deployment with offloading you need at least 128 GB of RAM.
What on-prem LLM is not
Probably the most common misconception in the decision-making process: on-prem LLM does not automatically mean compliance. We have seen organisations install Ollama on a workstation and confidently declare themselves GDPR compliant because "the AI runs locally".
Compliance is a process, not an installation state. To on-prem infrastructure you must add:
- A formal risk assessment and DPIA (Data Protection Impact Assessment) if you process sensitive personal data
- Records of processing activities that include the LLM system
- Retention rules — how long audit logs are kept, who has access
- An incident response plan — if a security breach occurs, what happens to the logs, who notifies the regulator
- Regular penetration testing of inference endpoints
Technical teams that tackle this alone, without legal input, typically build a system that works technically but fails a compliance audit on process documentation.
Comparison: on-prem vs. sovereign cloud vs. conventional cloud
For regulated industries there are actually three options, not two:
- Conventional cloud API (OpenAI, Anthropic, Google): lowest operational overhead, highest model capability, but data egress is real. Suitable for use cases that do not involve sensitive PII or data covered by sector-specific regulations.
- Sovereign cloud / EU region (Azure OpenAI EU, Anthropic Sovereign EU, OVH AI): data stays in the EU, the provider is bound by EU contracts, pricing is higher. For many organisations this is a better compromise than full on-prem — lower hardware investment, higher model capability, while preserving the GDPR framework.
- Full on-prem / air-gap: zero data egress, full control, auditability in the strictest sense. Requires a hardware investment, in-house operations, an in-house security stack. The only option for the most stringent regulations (for example, processors of classified information, certain categories of healthcare data).
For most SK/EU companies in regulated industries, sovereign cloud combined with selective on-prem for the most sensitive workloads is the pragmatic path. Not every LLM task needs to run on-prem — only those where the data demands it.
Guardrails and model safety
On-prem deployment addresses external data egress, but not internal risks. The model can hallucinate, produce misleading medical or legal content, or be exploited via prompt injection.
For regulated industries the following are essential:
- Output validation: LLM output should pass through a validation layer before being displayed or processed further. For structured outputs (data extraction from documents, classification) use constrained decoding (
XGrammarbackend invLLMorSGLang). - Human-in-the-loop for critical decisions: no on-prem model should automatically sign off on medical recommendations, approve loans, or generate legally binding documents without human review. More on this in human-in-the-loop for agents.
- Output monitoring: tracking refusals, unusual patterns in prompts, attempts to extract the system prompt or context.
Frequently asked questions
Is on-prem LLM always more expensive than cloud?
At low request volumes (up to a few thousand requests per day), a cloud API is cheaper — you do not need to invest in GPU hardware. At high volumes the curves cross: a dedicated GPU server typically amortises within 1–2 years at moderate load. For regulated industries, however, cost is not the primary driver — the question is what your organisation can afford from a compliance standpoint.
How much GPU VRAM do I need for typical enterprise use?
For most enterprise use cases (document analysis, internal copilot, classification) a 7–13B model at Q4_K_M quantisation is sufficient. An NVIDIA RTX 4090 (24 GB) or A5000 (24 GB) covers that. If you need a larger model (34B or 70B) for demanding legal or medical analysis, budget for dual-GPU or a professional card with 48–80 GB VRAM.
Do I need ISO 27001 or another certification for on-prem LLM to be legally compliant?
Not directly — neither GDPR nor sector-specific regulations mandate specific certifications, but they do require "appropriate technical and organisational measures". ISO 27001 is a framework that demonstrates systematic risk management — it significantly simplifies a compliance audit and is increasingly required by business partners.
Can I use an open-weight model commercially without legal risk?
It depends on the licence. Apache 2.0 and MIT are fully commercial with no restrictions. The Meta Llama licence permits commercial use, but requires a special agreement when active users exceed 700 million — not relevant for enterprise internal deployments. Always check the current licence text when selecting a model.
How do I ensure the model does not retain or transmit company data?
In a local deployment the model itself does not persist data — an LLM is a static set of weights, not a database. The risk lies in the peripheral layers: logs from the serving framework, KV cache written to disk (if enabled), or the context window shared across sessions due to misconfiguration. Ensure the serving engine is configured without cross-session context sharing, logs are either disabled or encrypted, and KV cache offload to disk is either disabled or stored on an encrypted volume.
*If you are considering on-prem LLM for healthcare, finance, or legal services and are not sure where to start — we are happy to look at your situation in concrete terms. We help with hardware selection, serving-stack architecture, and what you will need to show your compliance team. Contact us and we will begin with an assessment of your actual requirements.*
