One of the first things you notice after deploying an LLM in production is the spread of query complexity. Some requests are trivial — extracting a number from text, reformatting an address, validating a format. Others are genuinely complex — multi-step reasoning, synthesis across multiple documents, legal analysis. The problem is that most systems send all of these to the same model. You pay for a frontier model even when it's answering a question that a model at one-tenth the price would handle without issue.
LLM routing addresses exactly this. The idea is simple: before every LLM call, estimate how demanding the task is and send it to the cheapest model that can handle it reliably. Call the powerful model only when the situation genuinely requires it. In practice this means 75–90% of calls go to the cheaper model and overall costs drop without any visible quality degradation for the majority of use cases.
What routing is and what cascading is
Routing is a decision made before the first call: based on the input, choose a model. A classifier scores the query and sends it either to a small model or directly to a powerful one.
Cascading adds an extra step: the query is sent to the cheap model first and its response is evaluated — if it is sufficiently confident (confidence score, consistency, format), it is returned to the user. If not, the system automatically escalates to a stronger model, and you pay for that model only in this case. Both approaches can be combined: the router decides the entry-point model, the cascade decides whether to escalate.
The key difference: routing is proactive (classification before the call), cascading is reactive (escalation after the call when the output falls short).
Why it works — query complexity distribution in practice
In real production systems we have analyzed, the distribution breaks down roughly as follows:
- 30–40% of queries are routine — extraction, classification, reformatting, simple lookups. A small model handles these, for example from the Phi or Gemma family, or the cheap tier of frontier providers.
- 40–50% of queries are medium complexity — summarizing longer text, simple reasoning, FAQ responses with context. A mid-tier model suffices here.
- 10–20% of queries are genuinely hard — complex analyses, multi-hop reasoning, code generation with non-standard requirements, sensitive decisions. These deserve a frontier model.
If you send everything to a frontier model, you pay the strong-model price for every query, including the 30–40% that are routine. Frontier model pricing (e.g. Claude Opus class, GPT-5.x class) runs roughly $15–25 per million output tokens, while the cheap tier (Flash/Haiku equivalents) sits below $1–2. The difference is 15–25×. Even with modest routing where the small model handles only half of queries, costs can drop by 40–60%.
The academic project RouteLLM (LMSYS / UC Berkeley) showed that a well-trained router can send most simple queries to the cheap model while preserving most of the strong model's quality — in their measurements this translated to roughly 85% cost savings while retaining a substantial share of the strong model's performance. Exact figures depend on query mix and router calibration.
How the complexity classifier works
The heart of the router is a classifier that decides where to send each input. Several approaches exist with different trade-offs:
1. Simple heuristic router
Input length, keywords, task type (extraction vs. generation vs. analysis). Works surprisingly well for systems with a narrow use case and clearly separated query types. Zero overhead, deterministic behavior. Downside: brittle, requires manual rules, does not adapt to edge cases.
2. Embedding-based router
The query is embedded (e.g. with a small BGE model) and compared against representative examples from each tier. If the query is close to the sample of easy tasks, it goes to the cheap model. The embedding model runs locally; overhead is minimal (~5–15 ms). Advantage: requires no classifier training, easy to extend with new examples.
3. Trained binary classifier
A small model (logistic regression or small transformer) trained to predict whether a query requires the strong model. RouteLLM provides exactly this approach with pre-trained classifiers. Advantage: high accuracy after calibration on your own data. Downside: requires an annotated training set and ongoing updates.
4. LLM-as-a-judge router
The cheap model itself evaluates whether a query is hard. Counterintuitive, but functional in practice — a cheap model can distinguish "this is a simple extraction" from "this is a complex analysis," even if it cannot perform that complex analysis well itself. Overhead: one short extra LLM call. Suitable for systems with a cascade.
For most production systems we recommend starting with an embedding-based or simple heuristic router. Add a trained classifier once you have sufficient operational logs for training (on the order of thousands of annotated examples).
A practical cascade pattern
In practice, cascading looks like this:
- 1.A query arrives in the system.
- 2.The small model generates a response along with a confidence score.
- 3.If the score exceeds a threshold (e.g. 0.85), the response is returned to the user.
- 4.If not, the same query is sent to the medium or strong model and its result is returned.
Confidence scores can be derived in several ways:
- Token log-probability — the average log-probability of the generated tokens. A low value indicates the model was "fumbling." Available in serving frameworks such as
vLLMorSGLangvia thelogprobsparameter. - Sampling consistency — generate 3–5 different responses (higher temperature) and measure agreement. If the model contradicts itself, escalate.
- Verification prompt — a secondary call (with the cheap model): "Is this answer correct and complete? Reply only yes/no."
The escalation threshold is a key parameter that must be calibrated to your specific use case. Too high a threshold means many unnecessary escalations and lost savings. Too low a threshold means the small model answers things it cannot handle, degrading quality.
For systems with a clearly defined task (e.g. classifying documents into categories, extracting structured data), cascading is extremely effective — the small model handles 80–90% of cases, and the strong model only receives the genuinely ambiguous ones. For generative use cases (free text, creative writing, complex answers to open-ended questions), cascading is harder to configure because "correctness" of the answer is difficult to assess automatically.
The cost of routing — it is not free
Routing is not free, and it is important to account for its own costs:
- Added latency — every classification call adds time. Embedding-based router: ~5–15 ms. LLM-as-a-judge: 50–200 ms. For real-time interactive applications this can be noticeable.
- Cascade on misclassification — if the small model answers a hard query and the answer is wrong, you pay for two calls (small + strong) instead of one. A misclassification therefore does not reduce costs — it raises them.
- Operational complexity — the router is an additional component in the system that can fail, degrade, and must be monitored and updated.
- Cold-start problem — a new router without historical data classifies poorly. The first few weeks may be worse than a single-model strategy.
Routing therefore works best in systems with high volume, heterogeneous queries, and a clear distinction between easy and hard tasks. If you have 100 queries a day and they are all roughly equally demanding, the router overhead will not produce meaningful savings.
Before deploying, answer this: what is the actual complexity mix of your queries? If you do not have data, spend two weeks logging and manually annotating a sample of 200–300 queries. This exercise will also reveal other problems — for example, that 40% of calls contain the same long system prompt, where prompt caching would save more than routing.
When routing is not enough and what else to do
Routing addresses one cost dimension — model selection. There are situations where a different optimization is more effective:
Repetitive prefix prompts → Prompt caching is usually the stronger lever. If 80% of your calls share the same 3,000-token system prompt, prompt caching at Anthropic reduces the cost of those tokens by 90%. Routing would not come close to the same result.
Cost growth from an agent with a long history → Context truncation and summarization help here, not routing. See AI agent costs in production.
Slow response under high load → Routing is only a partial answer here. The choice of serving framework (vLLM vs. Ollama), continuous batching, and KV cache size have a greater impact on throughput.
Routing by content, not complexity → A different use case: some queries should go to a model trained on a specific domain (e.g. a fine-tuned model for manufacturing documentation), others to a general model. This is content-based routing and is a separate architectural layer, independent of cost optimization.
Implementation: where to start
If you want to try routing, here is the sequence we recommend:
- 1.Collect data before you implement anything. Log queries with metadata — length, task type, response time, quality score (if available). Without data you will calibrate every router blind.
- 1.Start with a heuristic router on the most obvious split. If you know that 30% of your queries are classifications into four categories and 70% are generative, a simple rule of "if type=classification → small model" delivers immediate savings with no ML overhead.
- 1.Test small-model quality on real data. Do not rely on benchmark numbers. Take 100 real queries from your production, run them through the small model, and evaluate manually. You will discover where the model fails and where it surprises you.
- 1.Implement a cascade with logprob thresholding. Both
vLLMandSGLangexpose token log-probabilities. Set the threshold experimentally: start at 0.80, measure escalations and false positives.
- 1.Monitor the escalation distribution over time. If the escalation rate grows, either the query type is shifting or the small model is degrading on some category. The router needs ongoing calibration — it is not a "set and forget" component.
For an open-source starting point, RouteLLM from LMSYS / UC Berkeley (Apache 2.0) is a solid foundation — it provides pre-trained classifiers and an evaluation harness. For enterprise deployments requiring on-premises operation, the embedding-based approach with BGE-M3 embeddings and a local Qdrant vector store is more practical from a data-sovereignty standpoint.
When not to do routing
Despite the savings, routing has real contraindications:
- Low volume (fewer than a few thousand queries per day) — the implementation and maintenance overhead outweighs the savings.
- Homogeneous use case — if all queries are equally demanding (e.g. purely complex legal analysis), routing only adds complexity with no benefit.
- High cost of error — in regulated environments where even a partially wrong answer from a cheap model causes a problem (healthcare, compliance, legal decisions), cascading is riskier. Here it is safer to define a narrow set of tasks for the small model with 100% manual verification.
- Model non-determinism on escalation — if your application requires reproducibility (same query always produces the same answer), cascading complicates this: sometimes the small model answers, sometimes the strong one, and outputs differ.
Frequently asked questions
How do I determine whether I have enough query diversity for routing to be worthwhile?
Export a sample of 200–300 real queries and manually classify them into three groups: easy, medium, hard. If more than 30% fall into the "easy" category, routing will deliver measurable savings. If the distribution is uniform or skewed toward "hard," the effect will be marginal.
How does routing differ from load balancing across multiple instances of the same model?
Load balancing addresses throughput and availability — it distributes requests across multiple instances of the same model. Routing addresses model selection — it decides *which* model (a different capability class, a different price) a given query goes to. They are complementary: you can have routing between models and load balancing within each model.
What should I do when the small model answers but the response looks plausible yet is factually wrong?
This is the biggest risk of cascading. Confidence scores derived from log-probabilities do not detect factual errors — a model can be "confident" and still be wrong. Three measures help: (1) restrict the small model to tasks with verifiable output (extraction, classification), not factual questions; (2) add post-processing output verification (regex, JSON schema, checklist); (3) in sensitive domains, always escalate to the strong model or add a human-in-the-loop step.
What is the difference between routing and a multi-agent system where the agent itself decides what to call?
In a multi-agent system the agent orchestrates other tools and models as part of solving a task — decision-making happens inside the LLM loop. A router sits in front of the LLM call and is an external classifier. A router is faster and cheaper to run, but does not have access to the task semantics to the same depth an agent does. For complex pipelines see AI agent architectures.
Can I apply routing to local models, not just cloud APIs?
Yes, and the benefit can be even more pronounced here. On the same server you can run a small 7B model and a larger 34B model. The router sends simple queries to the 7B (lower VRAM, shorter latency) and complex ones to the 34B. The savings are not financial but capacity-based — the 7B handles a much higher throughput on the same hardware, so the 34B stays available for hard cases without a queue building up.
*At MP Industrial Solutions we help companies design a routing strategy that makes sense for their specific query mix — from simple heuristics to a trained classifier built on their own data. If you are managing LLM costs in production or building serving infrastructure for multiple use cases, we are happy to look at your situation together.*
