A client says: "We want to upload our corporate documentation to GPT-5 / Claude / Llama so it answers questions from our employees / clients / partners." Half imagine fine-tuning, the other half RAG, and a third half an uncertain mixture of both. This article is a decision framework for the first workshop: when RAG, when fine-tuning, when a combination, and when you should wait half a year and deploy nothing.
Two worlds, two goals
**RAG (Retrieval-Augmented Generation):** - Data is external, the model doesn't see it during training - At inference the model receives the question + relevant chunks of data as context - "Give me the 5 most relevant paragraphs from documentation that answer question X" → we send them to the model - The model answers based on precise documentation and can cite the source
**Fine-tuning:** - Data is baked into the model's weights during training - At inference the model "remembers" the data (or at least its statistical reflection) - The model answers with the style / format / domain knowledge we taught it - The original data source is NOT accessible at inference, only its parametric representation
These two worlds **don't solve the same problem.** The most common mistake clients make: choosing fine-tuning when their real problem requires RAG.
Test: which is your task?
Answer these four questions:
1. Are you looking up FACTS in the data, or teaching STYLE?
- **Facts** ("What is our hourly rate for client X?", "What are the parameters of machine Y?") → **RAG**. The fact must be loaded precisely from an authoritative source. A fine-tuned model invents the fact (hallucination is an unpredictable function of training data).
- **Style** ("Write in formal legal language", "Answer in the structured format of our technical reports") → **fine-tuning** can help. RAG with the right system prompts often achieves 80–90% of the same result.
2. How often does the data change?
- **Daily / weekly** → **RAG**. Retraining the model on every data change costs $50–500 and 2–8 hours. Re-indexing the RAG knowledge base = 5 minutes, €0.50.
- **Monthly / quarterly** → either. RAG is equally comfortable.
- **Once every 2+ years** → fine-tuning can be considered if domain knowledge is stable (medical protocols, legal codes, technical standards).
3. Must the answer be auditable?
- **Yes (regulated sectors)** → **RAG is almost mandatory**. The client must be able to demonstrate: "The model said X because it saw Y in document Z." A fine-tuned model "said X" without the ability to prove where it knows it from.
- **No** → fine-tuning comes into play.
4. What volume of data do you have?
- **< 100k tokens** → neither RAG nor fine-tuning. Put them directly into the model's system prompt with a 200k context window (Claude Sonnet 4.6, Gemini 2.5 Pro). Simplest, fastest.
- **100k – 10M tokens** → **RAG** is optimal. A vector index over 1–10M tokens is 200 MB of memory, sub-100 ms latency.
- **10M – 1B tokens** → RAG works but needs a more sophisticated architecture (multi-stage retrieval, hybrid search, reranking). Fine-tuning as help, not as replacement.
- **> 1B tokens** → fine-tuning as a pre-training step + RAG on top.
When fine-tuning clearly wins
1. Domain language / terminology
Slovak case law, medical Latin, technical abbreviations in your company ("PVRZ" = name of a production protocol that even Google can't guess). The base model doesn't know it. Fine-tuning teaches it.
Example: Mistral 7B fine-tuned on 5,000 examples of Slovak legal documentation → answers in the correct legal language, knows the terminology "odporca", "navrhovateľ", "dohodárenstvo", "zmiernenie sankcie" in the correct context. The base model writes in Wikipedia style.
Cost: SFT on 5,000 examples, RTX 4090, ~6 hours, ~€10 electricity. Realistic in practice.
2. Structured outputs with strict format
"Always answer in JSON with this schema." A system prompt achieves 95% accuracy. Fine-tuning achieves 99.5+%. In production systems, the difference between 95% and 99.5% is critically important — at 95% you have 5% parse errors that leak through the entire downstream pipeline.
3. Speed (latency + cost) at high throughput
RAG = embedding (50 ms) + retrieval (100 ms) + LLM with an enlarged prompt (8,000 tokens × 100 RPS = expensive). Fine-tuned model = LLM with a short prompt (500 tokens × 100 RPS).
At >100 RPS workloads, fine-tuning is 5–10× cheaper. At <10 RPS it doesn't matter.
4. Off-line / on-device deployment
A mobile client can't call a RAG knowledge base. A fine-tuned 1B–4B model running on the device (CoreML, ExecuTorch, llama.cpp) — has all domain knowledge baked in, no internet needed.
When RAG clearly wins
1. Data changes quickly
Customer support knowledge base, FAQ, product documentation, internal wikis. Adding a new document = re-index (seconds). Fine-tuning would mean a new training run every day.
2. Citations are mandatory
Compliance, law, medicine, financial advisory. The client must see: "The model thinks X because article 12 paragraph 3 of law Y says so." Fine-tuning doesn't produce citations — it produces a paraphrased answer without an audit trail.
3. Per-user personalization
User A sees their data, user B sees theirs. The model is the same, but the knowledge base is filtered per-user. A fine-tuned model can't change what it knows according to the user.
4. Multi-language / multi-domain
The client has documentation in SK, EN, DE and wants to answer in the language of the question. RAG: one model, 3 knowledge bases (or 1 base with language metadata). Fine-tuning: 3 models, or more complex multi-task training.
Hybrid approach — the most common production reality
In real deployments in 2026, the typical combination is:
1. **Base model:** Claude Sonnet 4.6 or Llama 3.3 70B (open-weight) 2. **Light fine-tuning (LoRA):** on 1–5k domain-specific Q&A examples, teaches the model "how to answer" in your company's style and format 3. **RAG:** over live data (documents, database, ticket system) 4. **System prompt:** summarizes context, identity, guardrails 5. **Reranker:** BGE-Reranker, Cohere Rerank — after retrieval, reorders chunks so the most relevant are on top
This stack solves: the model knows "how to answer" (fine-tune), knows "current data" (RAG), knows "who we are and what the rules are" (system prompt). Plus source citations, plus auditability.
Specific tooling 2026
RAG stack — our default choice
- **Vector DB:** Qdrant (self-hosted) or Weaviate. PostgreSQL + pgvector for small use cases (< 1M vectors).
- **Embedding model:** BGE-M3 (open, SK/EN/DE multilingual) or OpenAI text-embedding-3-large for cloud-only setups.
- **Reranker:** BGE-Reranker-Large or Cohere Rerank 3.
- **Orchestration:** LangChain or LlamaIndex for quick PoC, custom Python code for production (LangChain's layer of abstraction becomes a tax in larger systems).
- **Document parsing:** Docling (IBM, open) or Unstructured.io for PDF/DOCX/HTML.
- **Chunking strategy:** semantic chunking (250–500 tokens per chunk), 10–20% overlap, metadata-rich.
Fine-tuning stack — when we use it
- **Framework:** Unsloth (2–5× faster than vanilla TRL), HuggingFace TRL for standard workflows.
- **Method:** LoRA (rank 16–32) or QLoRA for VRAM-constrained setups. Full fine-tuning only at >100k examples.
- **Base model:** Llama 3.3 70B, Mistral Small 3 (22B), Qwen 2.5 32B depending on license + language.
- **Eval:** Custom eval set with 200+ questions + standard benchmarks (MMLU, HellaSwag) to detect regression.
- **Serving:** vLLM or SGLang for throughput, llama.cpp for local / on-device.
Costs — real numbers 2026
RAG deployment (typical B2B knowledge base)
- 50k documents, 10M tokens, 500 RPS peak
- Vector DB: Qdrant on a 32GB VPS, $80/month
- Embedding (BGE-M3 self-hosted): RTX 4090 server, $200/month amortization
- LLM (Claude Sonnet 4.6): ~$3/M input tokens, ~$15/M output tokens. At 500 RPS with average 8k input + 500 output → **$4,500–6,000 monthly**
- Total: **~$5,500–6,500/month** plus one-time initialization $5–15k
Or fully local stack with Llama 3.3 70B on 2× H100: hardware $80–120k one-time, operation $300/month electricity + maintenance. Payback vs. cloud-only: 12–18 months.
Fine-tuning deployment
- One-time training (LoRA, 5,000 examples, Llama 3.3 70B): $30–80 cloud GPU, or $5 electricity on RTX 4090 if you own one
- Eval + iteration cycle: 3–6 iterations × $50 = $150–300
- Hosting fine-tuned model: same as base (LoRA premium is zero with merged weights)
- Maintenance: retrain every 3–6 months when the domain changes
Real fine-tuning cost in a production system: **< $1,000 annually**, if you have a team capable of maintaining it. The hidden cost is "the person who can do eval and interpret results" — not GPU.
When to deploy neither
- Data is small (< 50 documents) → use a cloud LLM (Claude Project, GPT Custom GPT, Gemini Workspace) directly, no custom infrastructure.
- The team has no MLOps capacity and you're not willing to invest in a data engineer for 6+ months.
- The domain changes rapidly (startup MVP, product experimentation) → wait until the data stabilizes.
- Client data is highly regulated and you don't have a completed DPIA (GDPR impact assessment) — first solve compliance, then deploy.
---
*We do both RAG and fine-tuning as part of AI integrations. If you're considering deploying an LLM over a corporate base, the first consultation (90 minutes) walks through these four decision questions against your real use case and gives you an indicative architecture and budget before you commit to one path or the other.*