When you compare two models, the first instinct is to open a leaderboard and look at the scores. MMLU, HumanEval, GSM8K — the numbers look objective and comparable. The problem is that frontier models now score 88–94% on MMLU, and within that range the differences between them are more likely measurement noise than real performance gaps. A benchmark that can no longer reliably distinguish between models has earned the label "saturated" — and a saturated benchmark tells you almost nothing about your specific use case.
This article explains why benchmarks lie (or at least mislead), what concrete mechanisms are at play, and what to do instead of blindly reading leaderboards. At the end you'll find the practical framework we use when selecting models for clients.
Why MMLU is saturated and what that means
MMLU (Massive Multitask Language Understanding) was genuinely useful when it was created — it covered 57 domains from mathematics to law and medicine, and the first models scored somewhere in the 50–60% range. That made sense: the benchmark was harder than random guessing, but not insurmountable.
Today the picture is different. When all the models being compared cluster in the 88–94% band, several problems arise at once:
- The differences are within the margin of statistical uncertainty. A 1–2 percentage point gap on a typical test set may be an artefact of prompting, the order of answer choices, or simply generation variability.
- Prompt wording changes the result. Research consistently shows that the same model can achieve a result that differs by roughly ±5–10 percentage points simply by changing how a question is phrased. This is not a flaw in any particular model — it is a property of all language models.
- The benchmark does not measure what you need. MMLU tests knowledge in a multiple-choice format. Your application likely generates longer text, calls tools, works with context, or operates in a specific language. These are completely different tasks.
Training data contamination — the quiet source of inflated scores
One of the least openly discussed problems in benchmarking is contamination (data contamination). When a benchmark's test data appears in a model's training data — which is a real risk for models trained on a large portion of the internet — the model effectively "remembers" the correct answers rather than understanding the underlying material.
Detecting contamination is hard. Most model providers do not publish publicly available audits of their training data. Some release internal results of decontamination tests; others do not. The upshot is that when comparing two models on MMLU, you have no certainty whether you are looking at a real difference in capability or a difference in how many test questions appeared in the training data.
Practical consequence: when selecting a model, always prefer benchmarks with dynamically generated or non-publicly available test sets. Some newer benchmarks — such as LiveBench or MMLU-Pro — try to address this problem through regular updates and a stronger emphasis on reasoning rather than memorisation.
Overfitting to the leaderboard as the core problem
There is also a less innocent version of the same issue: targeted optimisation toward specific benchmarks. When model providers know that the market compares on MMLU, HumanEval, and GSM8K, strong economic pressure arises to train (or fine-tune) models with particular emphasis on exactly those sets.
This is not necessarily fraud — it can be a consequence of training data selection, the manner of instruction tuning, or RLHF reward models. The result is the same, however: a model that looks great on a leaderboard can be substantially worse on real-world tasks the benchmark does not cover.
We have seen this in practice on projects involving industrial documentation: a model that won a coding benchmark could not reliably extract structured data from technical PDFs. A different model, with a lower overall score, turned out to be considerably better on that same task — because its training data likely contained more technical text in the relevant format.
Why leaderboard rank does not correlate with production behaviour
An aggregate leaderboard score is like the average temperature of a city: informative at a very coarse level, but useless when choosing what to wear on a particular day. Several reasons why a model's leaderboard rank may not correlate with production performance:
Domain specificity. General benchmarks test an average across dozens of domains. Your use case is one specific domain — manufacturing documentation, legal contracts, customer support in a particular language. A model that is strong on average may be weak in precisely your domain.
Language degradation. Most benchmarks are in English. Slovak is a low-resource language — models degrade on it considerably more than on English, but this degradation does not appear at all on an English leaderboard. Always test separately in your target language. From practical experience: a model that won a comparison in English could finish third or fourth on the same test in Slovak.
Input and output format. Benchmarks typically test short questions with short answers. Production applications work with long context, tool calls, structured JSON generation, or multi-turn conversations. These are different tasks with different model requirements.
Latency and cost. Leaderboards measure quality, not speed or price. The highest-scoring model may be 5× more expensive and 3× slower than a model scoring 2% lower — which in a production deployment can be decisive. For a deeper look at model selection with these factors in mind, see How to Choose an LLM Model in 2026.
Your own eval is always more important than a leaderboard
The conclusion that follows from the above is clear: there is no external benchmark that will tell you which model is right for your use case. The only reliable source of truth is your own eval built on your data, your criteria, and your language.
How to do it in practice:
- 1.Collect real inputs from day one. Log every query and every response (with PII anonymisation). This is the gold reserve of your test cases.
- 2.Define quality criteria for your specific task. What does "a good answer" mean in your context? Factual accuracy? Format compliance? Absence of hallucinated numbers? Every application has different criteria.
- 3.Assemble an eval set from real cases. At least 50–100 examples, ideally 200–500. Cover common cases as well as edge cases. Annotate expected answers or at least criteria.
- 4.Automate scoring via `LLM-as-a-judge`. A stronger model (or a specialised eval prompt) scores the outputs of the model under test according to your criteria. This is standard production practice today. The topic is covered in more depth in How to Measure the Quality of an LLM Application.
- 5.Run the eval before every significant change. Swapping a model, adjusting a prompt, changing a retrieval strategy — any of these can unexpectedly degrade quality. Without regression tests, you will only find out from a customer.
Tools like Langfuse (open-source, self-hostable) or Promptfoo (open-source, CI/CD integration) significantly lower the barrier to introducing an eval process. These are not exclusively enterprise tools — small teams can deploy them too.
How to read benchmarks constructively
Despite all the caveats, ignoring benchmarks entirely is not the right approach either. They make sense in specific contexts:
Coarse screening of candidates. If you are comparing dozens of models and need to narrow the selection down to 3–5 finalists, a leaderboard is a legitimate first filter. Do not use it for the final decision, but using it to eliminate clearly weak models is fine.
Choosing the benchmark to match the task. Not all benchmarks are equally misleading. Look for those closest to your use case:
- For code generation: HumanEval, MBPP, SWE-bench
- For mathematical reasoning: MATH, GSM8K
- For long context: RULER, HELMET
- For instruction following: IFEval
- For general reasoning: MMLU-Pro (harder variant, less saturated)
Dynamic leaderboards. Platforms like ArtificialAnalysis aggregate multiple dimensions at once — quality, latency, cost, context window. That gives a far more realistic picture than MMLU scores alone.
Compare under identical conditions. If you are evaluating two models yourself, use identical prompts, identical temperature, and ideally the same eval framework. Any difference in conditions will contaminate the result.
Special case: Slovak and regulated domains
For applications in Slovak, one simple rule applies: never trust an English benchmark without verifying in your target language. Slovak is a low-resource language and models degrade on it considerably more than on English. This degradation does not appear in standard leaderboards at all, because most benchmarks are in English.
Practical approach: from the final 2–3 candidates selected on the basis of a leaderboard, run your own eval in Slovak on your real data. The ranking may change.
For regulated domains — law, medicine, pharmacy, financial advisory — an even stronger warning applies: benchmark scores from general domains tell you nothing about how well a model handles legal clauses in Slovak, medical abbreviations, or regulatory text. This gap is why, in regulated domains, it is worth considering fine-tuning on domain data — which we cover in more depth in RAG vs Fine-Tuning — How to Decide.
Red flags when reading benchmark claims
When you encounter benchmark results in a provider's presentation or a PR article, watch for these warning signs:
- Benchmark without a measurement date. Models are updated, benchmarks change — a number without a date may be months out of date.
- Selective choice of benchmarks. If a provider cites 5 benchmarks where it wins and does not mention the others, that is selection bias.
- Missing information about prompting. "Few-shot" vs. "zero-shot", chain-of-thought vs. direct answer — each of these can shift the result by several percentage points. Without this information, the numbers are not comparable.
- Comparison with benchmarks of a different vintage. Comparing a new model against two-year-old competitors is legitimate marketing, but not an objective comparison.
- A benchmark the model likely saw during training. The older and more popular a benchmark (such as MMLU), the higher the probability of contamination.
Frequently asked questions
Does a higher MMLU score mean a better model for my business?
Not necessarily. MMLU is a saturated benchmark — differences between frontier models in the 88–94% band are on the edge of statistical uncertainty. For your specific use case, only scores on tasks similar to your own, in your deployment language, are relevant. Your own eval on real data is a more reliable indicator than any general leaderboard.
Can I trust a benchmark if the model provider published it themselves?
With caution. Providers have a legitimate interest in presenting their model in the best possible light, which can lead to selective choice of benchmarks or more favourable testing conditions. That does not mean the numbers are fabricated — but always look for independent reproductions, for example on platforms like ArtificialAnalysis, or run your own comparison.
What is training data contamination and why is it a problem?
Contamination occurs when a benchmark's test questions appear in a model's training data. The model then "remembers" the correct answers rather than deriving them from understanding. The result is an inflated benchmark score that does not reflect actual capabilities. Detection is difficult because most providers do not publish the precise composition of their training data.
How quickly can I build my own basic eval?
For a first eval, 50–100 real examples from your application with annotated expected outputs or quality criteria are sufficient. Tools like Promptfoo or Langfuse allow you to run the first automated scoring within days, not weeks. The key is to start small and iterate — not to wait for the "perfect" set.
Does the ranking of models change between versions?
Yes, significantly and unpredictably. A model update (even while keeping the same name) can change performance on different tasks in different directions. This is why having an eval set up as a regression test — not a one-time comparison — is critical for production deployments. Every significant change (new model version, new prompt, new retrieval system) must be verified against the same eval set.
*We help companies set up an eval process that works in production — from selecting test cases through automated scoring to CI/CD pipeline integration. If you are not sure where to start, we are happy to look at your specific use case.*
