At a client in the energy sector we deployed RAG over technical standards and operational guidelines. After the first two weeks in production, operators reported that the system "sometimes answers correctly, sometimes not". When we looked more closely, it turned out the problem had two completely different root causes: in some cases retrieval was loading the wrong documents — the correct answer existed in the knowledge base but the system never found it. In other cases retrieval worked fine, but the generative model ignored the retrieved context and hallucinated its own answer. From the user's perspective both failures looked identical. Without metrics we had no way to tell them apart.
This is the core problem with evaluating RAG systems: retrieval and generation are two distinct components, each can fail for different reasons, and if you measure them together you lose the ability to diagnose. This article explains how to approach it — separately, systematically, with the tools that exist for exactly this purpose.
Why classic metrics are not enough
The first instinct when assessing a RAG system is to ask: "Is the answer correct?" and measure that manually or through simple comparison with an expected output. This approach has a fundamental flaw: it doesn't tell you *where* the pipeline failed.
Consider three scenarios:
- Retrieval loaded the right document, the model answered correctly from it — good result.
- Retrieval loaded the wrong document, the model answered consistently from it — bad result, but faithfulness is high (the model didn't hallucinate, it just had bad context).
- Retrieval loaded the right document, the model ignored it and hallucinated — bad result, faithfulness is low.
Three different scenarios, two different failure points, one shared symptom: a wrong answer. If you measure only the final output, you might fix retrieval and discover that generation still hallucinates — or vice versa.
This is why modern RAG evaluation splits measurement into two layers: retrieval metrics (context precision, context recall) and generation metrics (faithfulness, answer relevancy). That is precisely the philosophy behind the RAGAS framework.
RAGAS — what it is and what it isn't
RAGAS (Retrieval-Augmented Generation Assessment) is an open-source evaluation framework that measures RAG pipeline quality without requiring manual annotations for every case. Instead it uses an LLM-as-judge approach: a statement or claim is assessed by a separate language model.
What RAGAS is: a framework for offline evaluation of a RAG pipeline using a gold-standard dataset (called a golden set). You give it questions, reference answers, the contexts retrieved by your retrieval component, and the generated answers — and it computes four key metrics.
What RAGAS is not: real-time production monitoring, a replacement for evals of the full LLM application, or a tool for measuring the factual correctness of the knowledge base itself. If your documents contain incorrect information, RAGAS won't catch it — it measures consistency and relevance, not the truthfulness of your sources.
An important distinction from other types of evaluation: RAGAS focuses exclusively on the RAG component. If you want to measure LLM response quality independently of retrieval, or evaluate a fine-tuned model, you need different metrics — those belong to a separate area of evals. Similarly, fine-tuning evaluation (for example measuring whether fine-tuning actually helped) is a different discipline from RAG eval.
The four key RAGAS metrics
Context Precision
Context precision measures whether the documents returned by retrieval are genuinely relevant to the question. Specifically: what fraction of the retrieved context is useful for generating the correct answer?
High context precision means retrieval isn't returning noise — every retrieved chunk contributes to the answer. Low context precision signals that the system is pulling in too many unrelated documents, which degrades generation quality (the LLM has to navigate both relevant and irrelevant material simultaneously).
In practice: if you have low context precision, the problem typically lies in the embedding model, faulty document segmentation (chunking), or the top-K setting (you're retrieving too many documents). More on choosing and configuring retrieval components is covered in the article on hybrid search.
Context Recall
Context recall is the complementary metric: what fraction of the information needed for a correct answer is present in the retrieved context? It is measured against the reference answer from the golden set.
If context recall is low, retrieval is missing relevant documents — the information exists in the knowledge base but the system isn't finding it. Common causes: insufficient embedding model quality for the target language or domain, poor chunking (information is split across multiple fragments and none of them is sufficient on its own), or a K that is too small (you're retrieving only the top 3, but the relevant document ranked seventh).
Faithfulness
Faithfulness measures the degree to which the generated answer is grounded in the retrieved context. RAGAS does this by decomposing the answer into individual claims and verifying each one against the context — explicitly or implicitly.
This is the most important metric from a RAG system's trustworthiness perspective. Low faithfulness means the model is hallucinating — generating information that is absent from the retrieved context but sounds plausible. It is critically important to understand: faithfulness ≠ factual correctness. A model can be perfectly faithful (100% of the answer is grounded in context) while the context itself is wrong. If you want to measure factual accuracy, you must address knowledge base quality separately.
Low faithfulness is typically fixed at the system prompt level (instructions for the model to answer strictly from context), through model selection (some models follow instructions more consistently), or by adding a guardrails layer. More on guardrails for AI agents — including faithful-answer checks — is covered in the article on guardrails for AI agents.
Answer Relevancy
Answer relevancy assesses whether the answer actually addresses the question that was asked. RAGAS measures this by generating several hypothetical questions from the generated answer and verifying whether they resemble the original query.
Low answer relevancy can occur even with high faithfulness: the model may be faithful to the context but answer a different question than the one posed — typically when the context is vague or the system prompt is insufficiently directive. In practice this metric exposes problems with prompting and question formulation.
Golden set — the foundation of every evaluation
RAGAS metrics are only as good as the dataset you compute them on. The golden set is a collection of test cases in the format:
- Question — a real or representative query
- Reference answer — a verified correct answer (ground truth)
- Retrieved context — the documents returned by your retrieval component for that query
- Generated answer — the output of your RAG system
The challenge is that assembling a golden set is costly: typically 100–300 cases for a basic evaluation, 500–1,000 for something more robust. Manual creation of every case by a domain expert is slow and expensive.
Two strategies address this:
Synthetic golden set generation: RAGAS and other tools include functions for automatically generating test questions directly from your documents. An LLM reads a chunk, generates a question and a reference answer. Advantage: speed and scalability. Disadvantage: synthetic questions can be too simple or fail to reflect the queries real users actually ask. In practice we recommend mixing: 60–70% synthetic cases as a baseline, 30–40% real queries from production logs annotated by an expert.
Mining production logs: Once a RAG system is live, you log (question, answer) pairs. Selecting representative cases and annotating them with user feedback (thumbs up/down) or domain expert review gives you a realistic golden set grounded in actual usage.
An important principle: the golden set must be kept alive. When the knowledge base is updated, some test cases become stale. After major documentation updates always verify that the golden set still reflects the current state.
RAGAS in practice — integrating into the pipeline
RAGAS is a Python library that integrates straightforwardly into an existing RAG workflow. The basic use case looks like this: for each test case from the golden set, you call the evaluator with the question, the reference answer, the retrieved context, and the generated answer. The output is a score for each metric, both aggregated and per-case.
What you should know before deploying it:
Cost: RAGAS calls an LLM internally to compute metrics (LLM-as-judge). For a golden set of 200 cases you can expect hundreds to thousands of LLM calls, which represents non-trivial cost with frontier models. The standard recommendation is to use a smaller, cheaper model for the judge function (for example a Haiku/Flash-tier model), or where possible a local open-weight model. The faithfulness computation is especially token-intensive because it decomposes the answer into individual claims.
Cadenced evaluation, not just ad hoc: Evaluation makes sense as a regular process — at every change to a retrieval component, knowledge base update, or prompt modification. We recommend including a RAGAS evaluation in your CI/CD pipeline at minimum as a sanity check on a subset of the golden set before every deployment.
Interpreting absolute scores: A faithfulness score of 0.80 is not objectively good or bad — it depends on the use case. For medical documentation 0.80 is insufficient. For an internal helpdesk it may be perfectly adequate. Trends matter more than absolute values: if a prompt change dropped faithfulness from 0.85 to 0.72, you have a clear signal.
Separating retrieval eval from generation eval
A key practice we see underused: evaluating the retrieval component and the generation component must also be done independently, not only end-to-end via RAGAS.
Retrieval eval in isolation: You can evaluate retrieval without any LLM generation at all. For each test question you know which documents retrieval should ideally find (from the golden set). You measure precision@K and recall@K — how many of the relevant documents landed in the top-K results. This gives you a clean signal about the quality of the embedding model, indexing strategy, and search configuration — without the generative model masking problems or compensating for retrieval errors.
Generation eval in isolation: You can fix the context — instead of dynamically retrieved documents, you always give the model the same manually verified context — and measure only faithfulness and answer relevancy. This isolates and tests the model's ability to follow instructions and extract information from provided text.
Combining both perspectives with end-to-end RAGAS metrics gives a comprehensive picture of exactly where the pipeline is losing quality.
What RAGAS won't measure — and what to watch out for
RAGAS is a powerful tool, but it has limits that need to be explicitly understood.
Factual correctness of the knowledge base: As noted above, RAGAS measures the consistency of the answer with the context. If documents in the knowledge base contain outdated or incorrect information, RAGAS will not detect it. Knowledge base quality must be handled separately — through domain review, document dating, and update mechanisms.
Latency and cost in production: RAGAS is an offline evaluation tool. It won't tell you how the pipeline behaves under high query volume, what the actual latency is for the user, or what the production token costs are. For those metrics you need production monitoring — tools such as LangSmith, Langfuse, or a custom logging layer. AI agent observability and production monitoring are covered in more detail in the article on AI agent observability.
Security properties: RAGAS does not measure resilience against prompt injection, the system's ability to refuse inappropriate queries, or adherence to safety guardrails. This belongs to security eval, which is a separate discipline.
Language quality and register: If your users communicate in Slovak and the knowledge base is in English, your RAGAS scores tell you nothing about translation quality or the naturalness of Slovak responses. If you're running a multilingual setup, you need human evaluation of language quality.
Frequently asked questions
How many cases do I need in my golden set for reliable results?
For a first baseline, 50–100 cases is enough to get a directional read. For decisions such as "change the embedding model" or "revise the chunking strategy" we recommend 200–400 cases covering different question types and knowledge base sections. Below 50 cases the results are too sensitive to individual outliers.
Can I use RAGAS for real-time production monitoring?
Not directly — every RAGAS measurement costs LLM calls, which is too expensive to run on every production query. The typical approach is sampling: instead of measuring every query you take a random 1–5% sample, run the RAGAS evaluation asynchronously, and track trends over time. For real-time production monitoring, simpler signals work better: thumbs up/down feedback from users, latency, and fallback response counts.
How do I tell whether the problem is in retrieval or generation?
The simplest test: manually copy a relevant document directly into the context (bypassing retrieval) and send it to the model. If the answer is good, the problem is in retrieval. If the model hallucinated or answered irrelevantly even with a perfect context, the problem is in the generation layer — the prompt, the model, or the way you format context. This manual test is faster than a full eval and reveals most problems in minutes.
Is LLM-as-judge reliable? Can't it assess incorrectly?
LLM-as-judge has its own blind spots — for example it tends to rate longer answers higher even when they are less precise, or it favours styles close to the judge model's training data. RAGAS partially compensates for this by decomposing faithfulness into specific claims and verifying each one separately. For critical use cases (regulated industries, security-sensitive systems) we recommend combining automated eval with human review of at least a subset of cases.
Which model should I use as the judge in RAGAS?
For most cases a mid-tier frontier model (Sonnet/Flash/Haiku tier) is sufficient — accurate enough and significantly cheaper than the maximum tier. If you have on-premises requirements or sensitive data, RAGAS also supports local models via an OpenAI-compatible API — a strong open-weight model from the Qwen3 or Llama family running inference via vLLM works well for judgment tasks.
*If you're unsure where to start with evaluating your RAG system, or want to find out exactly where your pipeline is losing accuracy — MP Industrial Solutions conducts diagnostic assessments of RAG deployments and helps establish an evaluation process that produces actionable results, not just numbers.*
