Local LLM vs Cloud — When to Choose Which and Why

The client asks: "Which model is best for our use case?" This isn't a useful question. The best model for your task is named according to what else it brings beyond strict performance — where it runs, who has access to the log, how much operation costs.

Three questions before any model

1. May your prompts (and therefore your data) leave your own infrastructure?

This is a technology-legal question. Three options:

Yes, anywhere. Here Anthropic Claude, OpenAI GPT, Google Gemini, Mistral all help. Lowest operational overhead, highest performance on all benchmarks.
Yes, but only in the EU. Here localized cloud helps (Azure OpenAI EU region, Anthropic Sovereign EU, OVH AI Endpoints). Slightly higher latency, slightly slower feature releases, higher price list.
No. Local deployment. vLLM / SGLang / llama.cpp on your own hardware. One-time investment in GPUs, operation in electricity.

The third option looks the least convenient. In regulated sectors (law, healthcare, finance), it's usually the only one that will pay off your compliance audit.

2. What is the expected daily consumption (tokens + requests)?

Cloud becomes expensive when requests run continuously. The price for a cloud LLM is $5–25 per million tokens; if your system processes 200 million tokens per day (by no means impossible), that's $1,000–5,000 per day. Monthly $30k–150k.

Local deployment (Llama 3.1 70B AWQ on 2× RTX A6000): one-time hardware ~$15k, monthly consumption ~$200, maintenance ~$500 monthly. Payback is measured in weeks, not years.

Conversely — if your use case is sporadic (50 queries per day, peak 500 per week), cloud never gets returned. The local server runs at 1% utilization and amortizes for nothing.

3. What is the maximum acceptable response latency?

< 1 s to first tokens? Local with warm cache or cloud close to the endpoint (cloud never gets close to a local GPU with a prompt-cache hit).
1–3 s? Either.
> 3 s? Cloud without question.

When local (unambiguously)

Data has compliance regulation (MiCA, GDPR article 9, HIPAA, ISO 27001 with explicit data residency).
Daily consumption > 50M tokens/day, stable predictable load.
Existing data MUST NOT be sent to the model provider, even if they claim they won't use it for training. Political risk vs. operational convenience — depends on the clause in the contract, not on the PR announcement.
A domain-specific fine-tune you'll need to redistribute — with a local model it means copying a file, with a cloud-hosted custom model it means vendor lock-in.

When cloud (unambiguously)

Sporadic use, daily volumes < 10M tokens, no regulation.
You need the very latest capabilities (Claude Opus 4.5, GPT-5, Gemini Ultra 2 can't be locally replicated — and by the time open models catch up, you're 6–12 months behind).
The team has no capacity for MLOps / a dedicated AI engineer; cloud sells for something including this.

When hybrid

The most common real scenario. Local model for 80% of requests (routine, compliance-sensitive). Cloud for 20% (complex, where the local model isn't enough, and where data is less sensitive). A router in front of both decides per-request where to send.

This requires: - A router with rule-based + LLM-as-router for routing decisions - Per-request audit log of where it went and why - Failover (if cloud fails, the local model takes over — but if the request is qualitatively above local, route to another cloud route)

The cost nobody puts in the deck

The cost of LLM operations isn't only the cost of tokens. It is: - The cost of prompt-engineering rounds. Somebody must tune prompts for the model — and the model occasionally changes (cloud upgrade), prompts need re-tuning. - The cost of fine-tune when your own prompts aren't enough. Local $200–2,000 per training run; cloud-hosted ~$10k+ for vendor-specific fine-tune. - The cost of eval set + regression tests. With every model upgrade, answers to 5–15% of questions can change. Somebody must maintain an eval set with 200+ questions that detects drift. - The cost of incident response when the vendor reduces capacity (lowered rate limit, increased latency) without notice. A local model eliminates this risk category entirely.

Real benchmark: after 18 months of operating an AI system with 5 engineers, the TCO of a local hybrid deployment is ~40% lower than a pure cloud-only deployment of the same performance.

What our default is

For small clients (< 5M tokens/day, low regulation) — cloud via OpenAI / Anthropic API directly. Cheap, fast, no MLOps.

For medium (5–100M tokens/day, simple compliance) — hybrid. vLLM locally for the base, cloud fallback for edge cases.

For large (> 100M tokens/day, regulated sector) — fully local. SGLang or vLLM + 2–4× GPU server, fine-tune via Unsloth, monitoring via Trackio.

This is not a universal formula. It is a starting point. The real choice goes through data, regulations, and the team you already have.

---

*We write this as a technical partner, not as a vendor of a specific stack. If you're interested in a concrete use case, we'll walk the numbers on a 30-minute call.*

Three questions before any model

1. May your prompts (and therefore your data) leave your own infrastructure?

This is a technology-legal question. Three options:

Yes, anywhere. Here Anthropic Claude, OpenAI GPT, Google Gemini, Mistral all help. Lowest operational overhead, highest performance on all benchmarks.
Yes, but only in the EU. Here localized cloud helps (Azure OpenAI EU region, Anthropic Sovereign EU, OVH AI Endpoints). Slightly higher latency, slightly slower feature releases, higher price list.
No. Local deployment. vLLM / SGLang / llama.cpp on your own hardware. One-time investment in GPUs, operation in electricity.

The third option looks the least convenient. In regulated sectors (law, healthcare, finance), it's usually the only one that will pay off your compliance audit.

2. What is the expected daily consumption (tokens + requests)?

Local deployment (Llama 3.1 70B AWQ on 2× RTX A6000): one-time hardware ~$15k, monthly consumption ~$200, maintenance ~$500 monthly. Payback is measured in weeks, not years.

Conversely — if your use case is sporadic (50 queries per day, peak 500 per week), cloud never gets returned. The local server runs at 1% utilization and amortizes for nothing.

3. What is the maximum acceptable response latency?

< 1 s to first tokens? Local with warm cache or cloud close to the endpoint (cloud never gets close to a local GPU with a prompt-cache hit).
1–3 s? Either.
> 3 s? Cloud without question.

When local (unambiguously)

Data has compliance regulation (MiCA, GDPR article 9, HIPAA, ISO 27001 with explicit data residency).
Daily consumption > 50M tokens/day, stable predictable load.
Existing data MUST NOT be sent to the model provider, even if they claim they won't use it for training. Political risk vs. operational convenience — depends on the clause in the contract, not on the PR announcement.
A domain-specific fine-tune you'll need to redistribute — with a local model it means copying a file, with a cloud-hosted custom model it means vendor lock-in.

When cloud (unambiguously)

Sporadic use, daily volumes < 10M tokens, no regulation.
You need the very latest capabilities (Claude Opus 4.5, GPT-5, Gemini Ultra 2 can't be locally replicated — and by the time open models catch up, you're 6–12 months behind).
The team has no capacity for MLOps / a dedicated AI engineer; cloud sells for something including this.

When hybrid

The cost nobody puts in the deck

Real benchmark: after 18 months of operating an AI system with 5 engineers, the TCO of a local hybrid deployment is ~40% lower than a pure cloud-only deployment of the same performance.

What our default is

For small clients (< 5M tokens/day, low regulation) — cloud via OpenAI / Anthropic API directly. Cheap, fast, no MLOps.

For medium (5–100M tokens/day, simple compliance) — hybrid. vLLM locally for the base, cloud fallback for edge cases.

For large (> 100M tokens/day, regulated sector) — fully local. SGLang or vLLM + 2–4× GPU server, fine-tune via Unsloth, monitoring via Trackio.

This is not a universal formula. It is a starting point. The real choice goes through data, regulations, and the team you already have.

---

*We write this as a technical partner, not as a vendor of a specific stack. If you're interested in a concrete use case, we'll walk the numbers on a 30-minute call.*

Six pillars,one delivery.

Industry & engineering

Electrical & automation

Automation & Control

Data centres & server rooms

AI, software & cloud

Smart home & IoT

Local LLM vs Cloud — When to Choose Which and Why

Three questions before any model

1. May your prompts (and therefore your data) leave your own infrastructure?

2. What is the expected daily consumption (tokens + requests)?

3. What is the maximum acceptable response latency?

When local (unambiguously)

When cloud (unambiguously)

When hybrid

The cost nobody puts in the deck

What our default is

Local LLM vs Cloud — When to Choose Which and Why

Three questions before any model

1. May your prompts (and therefore your data) leave your own infrastructure?

2. What is the expected daily consumption (tokens + requests)?

3. What is the maximum acceptable response latency?

When local (unambiguously)

When cloud (unambiguously)

When hybrid

The cost nobody puts in the deck

What our default is