The client asks: "Which model is best for our use case?" This isn't a useful question. The best model for your task is named according to what else it brings beyond strict performance — where it runs, who has access to the log, how much operation costs.
Three questions before any model
1. May your prompts (and therefore your data) leave your own infrastructure?
This is a technology-legal question. Three options:
- **Yes, anywhere.** Here Anthropic Claude, OpenAI GPT, Google Gemini, Mistral all help. Lowest operational overhead, highest performance on all benchmarks.
- **Yes, but only in the EU.** Here localized cloud helps (Azure OpenAI EU region, Anthropic Sovereign EU, OVH AI Endpoints). Slightly higher latency, slightly slower feature releases, higher price list.
- **No.** Local deployment. vLLM / SGLang / llama.cpp on your own hardware. One-time investment in GPUs, operation in electricity.
The third option looks the least convenient. In regulated sectors (law, healthcare, finance), it's usually the only one that will pay off your compliance audit.
2. What is the expected daily consumption (tokens + requests)?
Cloud becomes expensive when requests run continuously. The price for a cloud LLM is $5–25 per million tokens; if your system processes 200 million tokens per day (by no means impossible), that's $1,000–5,000 per day. Monthly $30k–150k.
Local deployment (Llama 3.1 70B AWQ on 2× RTX A6000): one-time hardware ~$15k, monthly consumption ~$200, maintenance ~$500 monthly. Payback is measured in weeks, not years.
Conversely — if your use case is sporadic (50 queries per day, peak 500 per week), cloud never gets returned. The local server runs at 1% utilization and amortizes for nothing.
3. What is the maximum acceptable response latency?
- < 1 s to first tokens? **Local with warm cache** or cloud close to the endpoint (cloud never gets close to a local GPU with a prompt-cache hit).
- 1–3 s? Either.
- > 3 s? Cloud without question.
When local (unambiguously)
- Data has compliance regulation (MiCA, GDPR article 9, HIPAA, ISO 27001 with explicit data residency).
- Daily consumption > 50M tokens/day, stable predictable load.
- Existing data MUST NOT be sent to the model provider, even if they claim they won't use it for training. Political risk vs. operational convenience — depends on the clause in the contract, not on the PR announcement.
- A domain-specific fine-tune you'll need to redistribute — with a local model it means copying a file, with a cloud-hosted custom model it means vendor lock-in.
When cloud (unambiguously)
- Sporadic use, daily volumes < 10M tokens, no regulation.
- You need the very latest capabilities (Claude Opus 4.5, GPT-5, Gemini Ultra 2 can't be locally replicated — and by the time open models catch up, you're 6–12 months behind).
- The team has no capacity for MLOps / a dedicated AI engineer; cloud sells for something including this.
When hybrid
The most common real scenario. Local model for 80% of requests (routine, compliance-sensitive). Cloud for 20% (complex, where the local model isn't enough, and where data is less sensitive). A router in front of both decides per-request where to send.
This requires: - A router with rule-based + LLM-as-router for routing decisions - Per-request audit log of where it went and why - Failover (if cloud fails, the local model takes over — but if the request is qualitatively above local, route to another cloud route)
The cost nobody puts in the deck
The cost of LLM operations isn't only the cost of tokens. It is: - The cost of `prompt-engineering` rounds. Somebody must tune prompts for the model — and the model occasionally changes (cloud upgrade), prompts need re-tuning. - The cost of `fine-tune` when your own prompts aren't enough. Local $200–2,000 per training run; cloud-hosted ~$10k+ for vendor-specific fine-tune. - The cost of `eval set + regression tests`. With every model upgrade, answers to 5–15% of questions can change. Somebody must maintain an eval set with 200+ questions that detects drift. - The cost of `incident response` when the vendor reduces capacity (lowered rate limit, increased latency) without notice. A local model eliminates this risk category entirely.
Real benchmark: after 18 months of operating an AI system with 5 engineers, the TCO of a local hybrid deployment is ~40% lower than a pure cloud-only deployment of the same performance.
What our default is
For small clients (< 5M tokens/day, low regulation) — cloud via OpenAI / Anthropic API directly. Cheap, fast, no MLOps.
For medium (5–100M tokens/day, simple compliance) — hybrid. vLLM locally for the base, cloud fallback for edge cases.
For large (> 100M tokens/day, regulated sector) — fully local. SGLang or vLLM + 2–4× GPU server, fine-tune via Unsloth, monitoring via Trackio.
This is not a universal formula. It is a starting point. The real choice goes through data, regulations, and the team you already have.
---
*We write this as a technical partner, not as a vendor of a specific stack. If you're interested in a concrete use case, we'll walk the numbers on a 30-minute call.*