In RAG (Retrieval-Augmented Generation) most debugging attempts in 2026 concentrate on the model's prompts and system instructions. The client changes the prompt 12 times, sits with it for 4 days and answer precision improves from 67% to 70%. Meanwhile three settings in the retrieval layer — chunking, embedding model, reranking — would in half a day of work push precision to 84%. This article is about those three settings.
Why retrieval decides, not the LLM
The base asymmetry of RAG architecture: if retrieval returns **the right pieces of documentation**, even an average 7B model gives a quality answer. If retrieval returns irrelevant or incomplete pieces, even Claude Opus 4.6 won't save the answer — the model simply doesn't have the truthful facts in context. "Garbage in, garbage out" in RAG is literally a physical law.
A concrete example: a legal RAG over Slovak legislation, 12,000 documents, 8M tokens. With default chunking (500 tokens fixed-size) and OpenAI text-embedding-ada-002 we hit precision@5 = 0.61, recall@10 = 0.72. After three settings (document-aware chunking, embedding upgrade to BGE-M3, Cohere Rerank 3) — precision@5 = 0.84, recall@10 = 0.93. **No LLM change, no prompt change.**
Setting 1: chunk size + chunking strategy
The most common configuration in PoC projects: `RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)` from the LangChain default. Works for 60% of use cases. For the rest it's wrong.
**Two decision dimensions:**
A) Chunk size (tokens)
- **128–256 tokens** — high precision, retrieval finds exactly the paragraph that answers. Risk: context around the paragraph is missing, the model doesn't see the whole thread of thought. Suitable for FAQ, code snippets, structured data.
- **256–512 tokens** — the most common compromise. Paragraph with surrounding context. The default choice for most B2B knowledge base deployments.
- **512–1024 tokens** — broader context, the model gets more connections, but retrieval precision drops (the embedding of a 1,024-token chunk "dilutes" the main theme). Suitable for longer narrative documents (legal decisions, research papers, technical manuals).
- **> 1,024 tokens** — rarely correct. Embedding models have a real effective length of ~512 tokens — most information past that threshold "dissolves" into one vector. There's an exception: long-context embedding models (Voyage-3-large, BGE-M3 in full mode, NV-Embed-v2) handle up to 8,192 tokens effectively.
B) Splitting strategy (fixed vs semantic vs document-aware)
- **Fixed-size chunking** — simple, deterministic, but breaks paragraphs, sentences, even words at the boundary. Context loss on ~15% of chunks, which translates to an 8–12% drop in precision@5.
- **Semantic chunking** — uses embeddings to detect boundaries of meaningful segments (e.g. `semantic_chunker` in LlamaIndex). Preserves context better, but chunks are of variable size — at inference you have to budget for variance of 200–800 tokens. Improvement in precision@5 by 5–8%.
- **Document-aware chunking** — exploits document structure (markdown headings, HTML `<section>`, PDF sections via Docling or Unstructured.io). Chunks correspond to the author's logical units. **The best choice for 80% of B2B use cases**, improvement in precision@5 by 12–18% over fixed-size.
A practical configuration for a legal RAG: Docling-parsed PDF → split by `<heading>` in serial order → if a heading section > 800 tokens, sub-split by paragraphs (`\n\n`) → metadata per chunk: `{doc_id, section, page, jurisdiction, paragraph_number}`. 10–20% overlap between adjacent chunks for continuity.
**A concrete benchmark:** 12,000 legal documents, fixed 500-token vs document-aware chunking. - Fixed: 38,400 chunks, avg 487 tokens, precision@5 = 0.61 - Document-aware: 24,200 chunks, avg 612 tokens, precision@5 = 0.73
Setting 2: embedding model
The most underestimated decision. The client picks OpenAI `text-embedding-ada-002` from 2022 because "it's in the LangChain quickstart," and loses 15–20% of the precision they'd gain from a more modern model.
Top picks in 2026 — cost/quality/latency tradeoff
**OpenAI text-embedding-3-large** - Dimension: 3,072 (reducible via Matryoshka representation to 256/512/1,024) - MTEB score: ~64.6 - Price: 0.13 USD / M tokens - Multilingual: good, but not the best for SK/CZ — in tests with Slovak legal texts precision@5 = 0.77 - Latency: ~80 ms per request (API) - **When:** all-in cloud setup, English + common EU language mix
**Cohere embed-multilingual-v3.0** - Dimension: 1,024 - MTEB score: ~63.8 (higher on the multilingual benchmark) - Price: 0.10 USD / M tokens - Multilingual: excellent for 100+ languages, especially strong for Eastern European languages (SK, CZ, HU, RO) - Latency: ~60 ms per request - **When:** multilingual knowledge base, EU compliance (Cohere has an EU region endpoint)
**sentence-transformers/all-mpnet-base-v2** - Dimension: 768 - MTEB score: ~57.8 - Price: self-hosted (CPU/GPU) - Multilingual: weak (primarily EN) - Latency: ~10 ms on CPU, ~3 ms on GPU - **When:** budget setup, EN-only, off-line
**BGE-M3 (BAAI/bge-m3)** - Dimension: 1,024 (dense), plus sparse + colBERT-style multivector - MTEB score: ~66.1 (multilingual benchmark) - Price: self-hosted - Multilingual: excellent for 100+ languages including SK - Latency: ~15 ms on RTX 4090 - **When:** SOTA choice for multilingual + self-hosted. **Our default in 2026 for EU clients.**
**Voyage AI voyage-3-large** - Dimension: 1,024 - MTEB score: ~68.2 (one of the highest in 2026) - Price: 0.18 USD / M tokens - Multilingual: excellent - **When:** premium cloud, where every 1% of precision counts
**A concrete benchmark:** the same legal corpus (8M tokens), document-aware chunking, no reranker. - OpenAI text-embedding-3-large: precision@5 = 0.77, latency 80 ms - Cohere embed-multilingual-v3: precision@5 = 0.79, latency 60 ms - BGE-M3 (self-hosted): precision@5 = 0.81, latency 15 ms - Voyage-3-large: precision@5 = 0.82, latency 95 ms
For Slovak legal content the gap between `text-embedding-ada-002` (0.61) and BGE-M3 (0.81) is **20 percentage points of precision** — the simplest change with the highest impact.
Setting 3: reranker
The most underrated pipeline component. The architecture: retrieval returns top-K candidates (typically K = 20–50) via embedding similarity, the **reranker** rescores them with a cross-encoder model (slower but more accurate) and picks top-N (typically N = 5) for the LLM context.
Why it works: bi-encoder embedding (fast retrieval) is "lossy" — it compresses the document into a single vector. A cross-encoder reranker (slower) sees the document + query together and evaluates their relevance per token. Top-20 reranking adds 50–200 ms of latency, but precision@5 typically rises by 12–18%.
Top picks in 2026
**Cohere Rerank 3** - Price: 2.00 USD / 1,000 search units (1 unit = 1 query + up to 100 documents) - Multilingual: excellent - Latency: ~100–150 ms per query (top-20 rerank) - API: simple, EU endpoint available - **When:** cloud setup, multilingual content, premium quality
**BAAI/bge-reranker-v2-m3 (BGE Reranker v2 M3)** - Price: self-hosted - Multilingual: excellent (same training as BGE-M3) - Latency: ~80–120 ms on RTX 4090 for top-20 - **When:** self-hosted setup, SOTA open-source choice for multilingual
**cross-encoder/ms-marco-MiniLM-L-6-v2** - Price: self-hosted, small model (66M parameters) - Multilingual: weak (EN-trained) - Latency: ~30 ms on GPU - **When:** budget setup, EN-only
**No reranker (baseline)** - Price: zero - **When:** PoC, low volume, latency below 200 ms is a hard requirement
A concrete benchmark — adding a reranker to an existing pipeline
Legal RAG, BGE-M3 retrieval, top-20 → top-5: - Without reranker: precision@5 = 0.81, latency 250 ms - Cohere Rerank 3: precision@5 = 0.88, latency 380 ms - BGE-Reranker-v2-m3 (self-hosted): precision@5 = 0.89, latency 340 ms
**A reranker lifts precision by 7–8% at the price of 130–150 ms.** For regulated knowledge bases (law, medicine, finance) it always pays off. For low-stakes chatbots (FAQ) a reranker can be unnecessary — it depends on the cost of one wrong answer.
Hybrid retrieval — when both keywords and semantics matter
For content where **keywords carry weight by themselves** (statute numbers, paragraphs, GUIDs, standard numbers, exact product names), a purely dense embedding retrieval can fail. The embedding captures semantic similarity, but a literal match for "§ 271 paragraph 2" can get lost in semantic noise.
**Solution: hybrid retrieval** = parallel BM25 (sparse, keyword-based) + dense embedding retrieval, the results are merged (typically with `reciprocal_rank_fusion`).
The 2026 stack: Qdrant 1.10+ with native sparse vectors + dense vectors, or OpenSearch with a pgvector layer, or Weaviate hybrid search API. Configuration: an alpha parameter decides the weight (0 = pure BM25, 1 = pure dense, default 0.5).
**Hybrid retrieval benchmark on the legal corpus:** - BM25 only: precision@5 = 0.68 (great for "§ 271", weak for "what applies on termination for health reasons") - Dense (BGE-M3) only: precision@5 = 0.81 - Hybrid (alpha=0.5): precision@5 = 0.87
For legal / medical / technical standards content hybrid is significantly better. For conversational FAQ content the gap shrinks to 1–2%.
A practical decision framework
1. **What kind of content?** Conversational FAQ / blog posts → dense embedding is enough. Structured document with sections / paragraphs → document-aware chunking is mandatory. 2. **What language?** EN-only → MTEB top performer (Voyage, BGE-M3, OpenAI). Multilingual EU languages → BGE-M3, Cohere multilingual-v3. 3. **What budget?** Cloud-only → OpenAI 3-large + Cohere Rerank 3. Self-hosted → BGE-M3 + BGE-Reranker-v2-m3. 4. **What latency do you tolerate?** < 200 ms → no reranker or a small cross-encoder. 200–500 ms → full pipeline with reranker. 5. **Are keywords critical?** Yes → hybrid BM25 + dense. No → pure dense. 6. **What query volume?** Above 100 RPS a self-hosted stack pays back in 4–8 months.
Practical iteration
Tuning RAG isn't a "set once, forget" project. The iteration stack that has worked for us:
1. **Eval set:** 200+ realistic queries + manually annotated "ideal" source documents. Without this you can't measure improvement. 2. **Baseline:** start with a simple pipeline (fixed chunking + one embedding model + no reranker). Score = baseline. 3. **Change one parameter at a time:** chunking → embedding → reranker → hybrid. After each change measure precision@5, recall@10, latency. 4. **Stop when marginal cost > marginal benefit.** Typically precision@5 above 0.90 needs disproportionate effort — fine-tuned embeddings or a custom reranker that doesn't pay off for use cases with tolerant quality.
In a legal RAG over 12,000 documents we lifted precision@5 from 67% to 88% in 4 weeks of work. The next 4 weeks of work would probably bring us to 90%. The client chose to stop — 88% was operationally sufficient and another 5 weeks of GPU + engineer time didn't return.
---
*We do RAG architecture + tuning for B2B knowledge bases with 1k–10M documents. If you have an existing pipeline and precision@5 has stalled below 75%, we'll walk through these three settings on your content in a 90-minute audit and give you a numerical baseline for a tuning roadmap.*