A model will confidently answer a question it cannot correctly answer. It invents a citation, fills in a missing number, confirms an assumption embedded in the question — and the whole response sounds convincing. We call this a hallucination: the generation of plausible but factually incorrect output. In experimental deployments it is merely annoying. In production — when working with documentation, contracts, technical standards, or customer requirements — it causes real damage.
In practice, the baseline hallucination rate of frontier models runs in the single to low double-digit percentage range for specialist questions where the model lacks sufficient grounding. The goal of this article is not to promise a hallucination-free system — no such thing exists. The goal is to show five techniques that can significantly reduce the problem in production, and to explain why each one works differently and why none of them is sufficient on its own.
Why hallucinations never disappear entirely
Before we get to the techniques, it is worth understanding the root cause. LLMs do not look up facts — they generate tokens based on what is statistically likely in a given context. A model does not distinguish between "I know" and "I don't know": unless it is explicitly trained or instructed to express uncertainty, it defaults to generating a fluent and plausible response.
Model size alone does not solve the problem — a larger model can hallucinate more confidently, because its language capabilities are better and its outputs sound more persuasive. Without a grounding strategy, scaling the model does not reduce risk; in some cases it actually increases it.
The consequence for production deployments: hallucinations must be managed architecturally, not simply by choosing a better model.
Technique 1: Grounding via RAG with citable sources
RAG (*Retrieval-Augmented Generation*) is today the most widely used mechanism for reducing hallucinations on factual questions. The principle is straightforward: instead of the model answering from parametric memory (what it learned during training), it receives relevant passages from verified documents in its context. The answer must then be grounded in that context.
Simply pasting documents into the prompt is not enough. A quality grounding system has three layers:
- Retrieval — vector search (e.g. via
Qdrantorpgvector) supplemented by hybrid search (BM25 + vectors), so that not only semantically similar but also keyword-relevant passages are retrieved - Reranking — a cross-encoder model re-scores the ranking of results; without reranking, retrieval returns both relevant and less relevant chunks; a reranker significantly improves selection precision
- Citability — the model is instructed that every claim must be backed by a specific source with a reference to the document and page; if no source exists, it does not invent one
Citability is critical not only for accuracy but also for verifiability: the user sees where the information comes from and can check it. In practice this significantly increases trust and also surfaces cases where the model ignored or fabricated a citation. You can read more about evaluating a RAG pipeline in the article How to evaluate RAG (RAGAS), where we cover the faithfulness and answer relevance metrics.
An important caveat: RAG does not solve everything. If retrieval returns the wrong chunks (embedding mismatch, poor chunking, stale database), the model can hallucinate even with context — or, worse, hallucinate with a false citation. RAG is a necessary, not a sufficient, condition.
Technique 2: Structured output and programmatic validation
Free text is hard to verify. If the model returns a response in a precisely defined structure — a JSON schema, a Pydantic model, an enumeration of allowed values — we can validate the output programmatically before it reaches the user or a downstream system.
Modern models and frameworks support structured outputs: the model is forced to generate tokens that conform to the supplied schema. A hallucination therefore cannot invent a field the schema does not contain. If the allowed field is risk_level with values ["low", "medium", "high"], the model cannot return "critical" or nonsense.
In practice we combine three steps:
- 1.Defining the output schema (e.g. a JSON schema or a Pydantic model)
- 2.Using
structured outputs/JSON modein the API call - 3.Programmatic validation of the output — if the output does not conform to the schema, the call is retried or escalated for human review
This approach works best on bounded tasks: entity extraction, classification, form filling, converting a document into a structured format. For free text (summaries, interpretation) its applicability is more limited, but even there we can validate at least the presence of required sections (e.g. a "Sources" or "Warnings" section).
Technique 3: Allowing "I don't know" — uncertainty calibration
One of the simplest and most effective steps is to explicitly allow the model to respond with "I don't know" or "I cannot determine that from the available information." This sounds trivial, but most production prompts neglect it — and the model then defaults to generating an answer even when it lacks the information.
Specifically, in the system prompt we recommend phrasing along these lines:
- If the answer does not follow from the supplied documents, explicitly state that the information is not available.
- Do not infer or derive facts that are not present in the context.
- If uncertain, express the degree of uncertainty in words ("probably", "according to the available information").
It is important that uncertainty calibration be an active instruction, not merely the absence of a prohibition against hallucinating. "Do not hallucinate" is not enough — the model understands the instruction but has no mechanism to act on it without an explicit framework for expressing uncertainty.
In regulated domains (legal documents, technical documentation, safety standards) we also recommend defining an escalation rule: if the model cannot answer with sufficient confidence, the response should include a prompt to verify with a qualified person. This directly reduces the risk of silent failure — the situation where the system answers with confidence and the error goes undetected.
Technique 4: LLM-as-Judge — automated output verification
LLM-as-Judge is a technique in which a second language model evaluates the output of the first. In production it is used for automated hallucination detection, consistency checking against the context, and response quality scoring — without requiring a human reviewer on every response.
A typical flow looks like this:
- 1.The primary model generates a response
- 2.The verification model receives a triple: question + source context + generated response
- 3.The verification model evaluates: does the response contain claims that are not supported by the context? Is the response factually consistent with the sources?
- 4.Based on the score, the response is either delivered, blocked, or sent for human review
For the verification model we recommend using a model of equal or greater capability than the primary — a weaker verifier cannot reliably detect errors made by a stronger one. In practice we see good results from combining a locally running primary model with a frontier verifier (e.g. Claude Sonnet or GPT-4o class) for critical responses.
Frameworks such as Langfuse or Arize Phoenix allow this flow to be embedded in a production pipeline with logging, alerting, and retrospective analysis. You can read more about the overall approach to measuring quality in the article How to measure LLM application quality (evals).
Limitation: LLM-as-Judge is not infallible — the verification model can also "agree" with a hallucinated response if both are mutually consistent but factually wrong. We therefore combine it with grounding (Technique 1) and regular human auditing of a sample of production outputs.
Technique 5: Temperature, sampling, and prompt engineering
The last technique is the cheapest to implement but has the smallest effect without the other layers. It is still important.
Temperature controls the randomness of generation. A low temperature (e.g. 0.0–0.2) produces more deterministic, consistent outputs — the model picks the most probable token rather than a random one. For factual tasks (extraction, classification, question answering from documents) we recommend low temperature. For creative or generative tasks a higher temperature is desirable, but it also increases the risk of deviations.
Prompt engineering for reducing hallucinations has several well-established principles:
- Few-shot examples — show the model examples of correct behavior, including examples where the right answer is "I don't know"; the model learns this behavior in context
- Explicit format specification — "answer only on the basis of the following documents; conclude every claim with a numbered citation"
- Negative instructions — "do not infer information that is not explicitly stated in the context"
- Chain-of-thought — for complex questions, prompt the model to reason through its answer step by step before producing the final output; for sensitive tasks this significantly reduces "shortcutting" to a hallucinated answer
A warning: prompts are brittle — they work for a specific model and version. When the model or version changes, prompts must be re-evaluated. Prompt engineering is therefore an ongoing activity, not a one-time deliverable. For more on this topic: Prompt engineering for production.
All five techniques together
None of the techniques is sufficient on its own. In practice we combine them in layers:
- Foundation layer — RAG with citations + "I don't know" permission in the system prompt
- Output layer — structured output + programmatic validation
- Verification layer — LLM-as-Judge for critical responses
- Calibration layer — low temperature + few-shot prompting for consistency
The effect of combining them is substantially greater than the sum of the individual techniques. Production systems where we have implemented all four layers achieve a significantly lower rate of factual errors than systems with a single layer — on the order of tenths of a percent versus single-digit percent for specialist questions. Exact figures depend on the domain, the quality of the source documents, and the evaluation methodology.
What will not help
For completeness: what unreliably reduces hallucinations:
- A larger model — it can hallucinate more confidently; without a grounding strategy it does not help
- A longer context window — more data in context does not mean fewer hallucinations; the model can ignore the relevant context or misinterpret it
- A simple "do not hallucinate" instruction in the prompt — the model does not understand the concept of "hallucinating" in a way that would allow it to actively prevent it; it needs positive instructions for expressing uncertainty
- Model benchmark scores — MMLU and similar benchmarks are saturated and do not measure production behavior in specialist domains; a leaderboard result does not reliably correlate with hallucination rate in your specific application
Frequently asked questions
Is it possible to build a system that never hallucinates?
No. LLMs are stochastic generative models — there is always a non-zero probability of hallucinated output. The goal is not zero error rate but reduction to an acceptable level for the given use case, combined with mechanisms for detecting and catching errors before they cause harm. For critical decisions (legal, medical, safety-related) human review remains essential.
Will fine-tuning on company data reduce hallucinations?
Only partially, and not directly. Fine-tuning (*adapting the model* on your own data) improves response style, terminology, and format — but it does not improve factual accuracy if the model does not receive the relevant context at inference time. For reducing hallucinations, RAG is typically a more effective and less expensive approach. Fine-tuning and RAG are not mutually exclusive — combining them makes sense in more advanced deployments. We cover the decision between them in the article RAG vs fine-tuning — decision guide.
How do we find out how much our LLM application hallucinates?
Systematic evaluation requires an eval dataset — a set of questions with correct answers or source documents — and an evaluation framework such as RAGAS (for RAG pipelines) or Langfuse (for general LLM applications). Automated evaluation using LLM-as-Judge can cover large volumes, but it must be calibrated against human judgment on at least a sample. We recommend periodic human auditing of production outputs at all times, not only at initial deployment.
Does low temperature increase the risk of stereotyped or incomplete responses?
Yes, this is a real trade-off. Low temperature reduces variability — the model consistently picks the most probable tokens, which in extreme cases can lead to repeated phrases or the omission of less probable but relevant information. In practice we recommend a temperature of 0.1–0.3 for factual tasks, not absolute zero, and combining it with explicit instructions for completeness of response.
What does it cost to introduce these techniques?
Costs are primarily implementation and operational. A RAG pipeline requires a vector database, an embedding model, and retrieval logic — with a self-hosted solution (e.g. Qdrant + an open-weight embedding model) infrastructure runs to a few euros per month. LLM-as-Judge doubles the number of LLM calls for critical outputs, which increases API costs. Structured outputs add virtually no overhead. Total investment depends on volume, but for typical enterprise applications (tens to hundreds of queries per day) the cost increase is negligible compared with the risk of silent failure in production.
*If you are planning to deploy an LLM in an environment where factual accuracy directly affects decisions — from technical documentation to customer support or compliance — we are happy to help you assess which combination of techniques makes sense for your specific case. MP Industrial Solutions has experience with production deployments in both industrial and regulated environments.*
