In RAG (Retrieval-Augmented Generation) most debugging attempts in 2026 concentrate on the model's prompts and system instructions. The client changes the prompt 12 times, sits with it for 4 days and answer precision improves from 67% to 70%. Meanwhile three settings in the retrieval layer — chunking, embedding model, reranking — would in half a day of work push precision to 84%. This article is about those three settings.

Why retrieval decides, not the LLM

The base asymmetry of RAG architecture: if retrieval returns the right pieces of documentation, even an average 7B model gives a quality answer. If retrieval returns irrelevant or incomplete pieces, even Claude Opus 4.6 won't save the answer — the model simply doesn't have the truthful facts in context. "Garbage in, garbage out" in RAG is literally a physical law.

A concrete example: a legal RAG over Slovak legislation, 12,000 documents, 8M tokens. With default chunking (500 tokens fixed-size) and OpenAI text-embedding-ada-002 we hit precision@5 = 0.61, recall@10 = 0.72. After three settings (document-aware chunking, embedding upgrade to BGE-M3, Cohere Rerank 3) — precision@5 = 0.84, recall@10 = 0.93. No LLM change, no prompt change.

Setting 1: chunk size + chunking strategy

The most common configuration in PoC projects: RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50) from the LangChain default. Works for 60% of use cases. For the rest it's wrong.

Two decision dimensions:

A) Chunk size (tokens)

128–256 tokens — high precision, retrieval finds exactly the paragraph that answers. Risk: context around the paragraph is missing, the model doesn't see the whole thread of thought. Suitable for FAQ, code snippets, structured data.
256–512 tokens — the most common compromise. Paragraph with surrounding context. The default choice for most B2B knowledge base deployments.
512–1024 tokens — broader context, the model gets more connections, but retrieval precision drops (the embedding of a 1,024-token chunk "dilutes" the main theme). Suitable for longer narrative documents (legal decisions, research papers, technical manuals).
> 1,024 tokens — rarely correct. Embedding models have a real effective length of ~512 tokens — most information past that threshold "dissolves" into one vector. There's an exception: long-context embedding models (Voyage-3-large, BGE-M3 in full mode, NV-Embed-v2) handle up to 8,192 tokens effectively.

B) Splitting strategy (fixed vs semantic vs document-aware)

Fixed-size chunking — simple, deterministic, but breaks paragraphs, sentences, even words at the boundary. Context loss on ~15% of chunks, which translates to an 8–12% drop in precision@5.
Semantic chunking — uses embeddings to detect boundaries of meaningful segments (e.g. semantic_chunker in LlamaIndex). Preserves context better, but chunks are of variable size — at inference you have to budget for variance of 200–800 tokens. Improvement in precision@5 by 5–8%.
Document-aware chunking — exploits document structure (markdown headings, HTML <section>, PDF sections via Docling or Unstructured.io). Chunks correspond to the author's logical units. The best choice for 80% of B2B use cases, improvement in precision@5 by 12–18% over fixed-size.

A practical configuration for a legal RAG: Docling-parsed PDF → split by <heading> in serial order → if a heading section > 800 tokens, sub-split by paragraphs (\n\n) → metadata per chunk: {doc_id, section, page, jurisdiction, paragraph_number}. 10–20% overlap between adjacent chunks for continuity.

A concrete benchmark: 12,000 legal documents, fixed 500-token vs document-aware chunking. - Fixed: 38,400 chunks, avg 487 tokens, precision@5 = 0.61 - Document-aware: 24,200 chunks, avg 612 tokens, precision@5 = 0.73

Setting 2: embedding model

The most underestimated decision. The client picks OpenAI text-embedding-ada-002 from 2022 because "it's in the LangChain quickstart," and loses 15–20% of the precision they'd gain from a more modern model.

Top picks in 2026 — cost/quality/latency tradeoff

OpenAI text-embedding-3-large - Dimension: 3,072 (reducible via Matryoshka representation to 256/512/1,024) - MTEB score: ~64.6 - Price: 0.13 USD / M tokens - Multilingual: good, but not the best for SK/CZ — in tests with Slovak legal texts precision@5 = 0.77 - Latency: ~80 ms per request (API) - When: all-in cloud setup, English + common EU language mix

Cohere embed-multilingual-v3.0 - Dimension: 1,024 - MTEB score: ~63.8 (higher on the multilingual benchmark) - Price: 0.10 USD / M tokens - Multilingual: excellent for 100+ languages, especially strong for Eastern European languages (SK, CZ, HU, RO) - Latency: ~60 ms per request - When: multilingual knowledge base, EU compliance (Cohere has an EU region endpoint)

sentence-transformers/all-mpnet-base-v2 - Dimension: 768 - MTEB score: ~57.8 - Price: self-hosted (CPU/GPU) - Multilingual: weak (primarily EN) - Latency: ~10 ms on CPU, ~3 ms on GPU - When: budget setup, EN-only, off-line

BGE-M3 (BAAI/bge-m3) - Dimension: 1,024 (dense), plus sparse + colBERT-style multivector - MTEB score: ~66.1 (multilingual benchmark) - Price: self-hosted - Multilingual: excellent for 100+ languages including SK - Latency: ~15 ms on RTX 4090 - When: SOTA choice for multilingual + self-hosted. Our default in 2026 for EU clients.

Voyage AI voyage-3-large - Dimension: 1,024 - MTEB score: ~68.2 (one of the highest in 2026) - Price: 0.18 USD / M tokens - Multilingual: excellent - When: premium cloud, where every 1% of precision counts

A concrete benchmark: the same legal corpus (8M tokens), document-aware chunking, no reranker. - OpenAI text-embedding-3-large: precision@5 = 0.77, latency 80 ms - Cohere embed-multilingual-v3: precision@5 = 0.79, latency 60 ms - BGE-M3 (self-hosted): precision@5 = 0.81, latency 15 ms - Voyage-3-large: precision@5 = 0.82, latency 95 ms

For Slovak legal content the gap between text-embedding-ada-002 (0.61) and BGE-M3 (0.81) is 20 percentage points of precision — the simplest change with the highest impact.

Setting 3: reranker

The most underrated pipeline component. The architecture: retrieval returns top-K candidates (typically K = 20–50) via embedding similarity, the reranker rescores them with a cross-encoder model (slower but more accurate) and picks top-N (typically N = 5) for the LLM context.

Why it works: bi-encoder embedding (fast retrieval) is "lossy" — it compresses the document into a single vector. A cross-encoder reranker (slower) sees the document + query together and evaluates their relevance per token. Top-20 reranking adds 50–200 ms of latency, but precision@5 typically rises by 12–18%.

Top picks in 2026

Cohere Rerank 3 - Price: 2.00 USD / 1,000 search units (1 unit = 1 query + up to 100 documents) - Multilingual: excellent - Latency: ~100–150 ms per query (top-20 rerank) - API: simple, EU endpoint available - When: cloud setup, multilingual content, premium quality

BAAI/bge-reranker-v2-m3 (BGE Reranker v2 M3) - Price: self-hosted - Multilingual: excellent (same training as BGE-M3) - Latency: ~80–120 ms on RTX 4090 for top-20 - When: self-hosted setup, SOTA open-source choice for multilingual

cross-encoder/ms-marco-MiniLM-L-6-v2 - Price: self-hosted, small model (66M parameters) - Multilingual: weak (EN-trained) - Latency: ~30 ms on GPU - When: budget setup, EN-only

No reranker (baseline) - Price: zero - When: PoC, low volume, latency below 200 ms is a hard requirement

A concrete benchmark — adding a reranker to an existing pipeline

Legal RAG, BGE-M3 retrieval, top-20 → top-5: - Without reranker: precision@5 = 0.81, latency 250 ms - Cohere Rerank 3: precision@5 = 0.88, latency 380 ms - BGE-Reranker-v2-m3 (self-hosted): precision@5 = 0.89, latency 340 ms

A reranker lifts precision by 7–8% at the price of 130–150 ms. For regulated knowledge bases (law, medicine, finance) it always pays off. For low-stakes chatbots (FAQ) a reranker can be unnecessary — it depends on the cost of one wrong answer.

Hybrid retrieval — when both keywords and semantics matter

For content where keywords carry weight by themselves (statute numbers, paragraphs, GUIDs, standard numbers, exact product names), a purely dense embedding retrieval can fail. The embedding captures semantic similarity, but a literal match for "§ 271 paragraph 2" can get lost in semantic noise.

Solution: hybrid retrieval = parallel BM25 (sparse, keyword-based) + dense embedding retrieval, the results are merged (typically with reciprocal_rank_fusion).

The 2026 stack: Qdrant 1.10+ with native sparse vectors + dense vectors, or OpenSearch with a pgvector layer, or Weaviate hybrid search API. Configuration: an alpha parameter decides the weight (0 = pure BM25, 1 = pure dense, default 0.5).

Hybrid retrieval benchmark on the legal corpus: - BM25 only: precision@5 = 0.68 (great for "§ 271", weak for "what applies on termination for health reasons") - Dense (BGE-M3) only: precision@5 = 0.81 - Hybrid (alpha=0.5): precision@5 = 0.87

For legal / medical / technical standards content hybrid is significantly better. For conversational FAQ content the gap shrinks to 1–2%.

A practical decision framework

1.What kind of content? Conversational FAQ / blog posts → dense embedding is enough. Structured document with sections / paragraphs → document-aware chunking is mandatory.
2.What language? EN-only → MTEB top performer (Voyage, BGE-M3, OpenAI). Multilingual EU languages → BGE-M3, Cohere multilingual-v3.
3.What budget? Cloud-only → OpenAI 3-large + Cohere Rerank 3. Self-hosted → BGE-M3 + BGE-Reranker-v2-m3.
4.What latency do you tolerate? < 200 ms → no reranker or a small cross-encoder. 200–500 ms → full pipeline with reranker.
5.Are keywords critical? Yes → hybrid BM25 + dense. No → pure dense.
6.What query volume? Above 100 RPS a self-hosted stack pays back in 4–8 months.

Practical iteration

Tuning RAG isn't a "set once, forget" project. The iteration stack that has worked for us:

1.Eval set: 200+ realistic queries + manually annotated "ideal" source documents. Without this you can't measure improvement.
2.Baseline: start with a simple pipeline (fixed chunking + one embedding model + no reranker). Score = baseline.
3.Change one parameter at a time: chunking → embedding → reranker → hybrid. After each change measure precision@5, recall@10, latency.
4.Stop when marginal cost > marginal benefit. Typically precision@5 above 0.90 needs disproportionate effort — fine-tuned embeddings or a custom reranker that doesn't pay off for use cases with tolerant quality.

In a legal RAG over 12,000 documents we lifted precision@5 from 67% to 88% in 4 weeks of work. The next 4 weeks of work would probably bring us to 90%. The client chose to stop — 88% was operationally sufficient and another 5 weeks of GPU + engineer time didn't return.

---

*We do RAG architecture + tuning for B2B knowledge bases with 1k–10M documents. If you have an existing pipeline and precision@5 has stalled below 75%, we'll walk through these three settings on your content in a 90-minute audit and give you a numerical baseline for a tuning roadmap.*

Why retrieval decides, not the LLM

Setting 1: chunk size + chunking strategy

The most common configuration in PoC projects: RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50) from the LangChain default. Works for 60% of use cases. For the rest it's wrong.

Two decision dimensions:

A) Chunk size (tokens)

128–256 tokens — high precision, retrieval finds exactly the paragraph that answers. Risk: context around the paragraph is missing, the model doesn't see the whole thread of thought. Suitable for FAQ, code snippets, structured data.
256–512 tokens — the most common compromise. Paragraph with surrounding context. The default choice for most B2B knowledge base deployments.
512–1024 tokens — broader context, the model gets more connections, but retrieval precision drops (the embedding of a 1,024-token chunk "dilutes" the main theme). Suitable for longer narrative documents (legal decisions, research papers, technical manuals).
> 1,024 tokens — rarely correct. Embedding models have a real effective length of ~512 tokens — most information past that threshold "dissolves" into one vector. There's an exception: long-context embedding models (Voyage-3-large, BGE-M3 in full mode, NV-Embed-v2) handle up to 8,192 tokens effectively.

B) Splitting strategy (fixed vs semantic vs document-aware)

Fixed-size chunking — simple, deterministic, but breaks paragraphs, sentences, even words at the boundary. Context loss on ~15% of chunks, which translates to an 8–12% drop in precision@5.
Semantic chunking — uses embeddings to detect boundaries of meaningful segments (e.g. semantic_chunker in LlamaIndex). Preserves context better, but chunks are of variable size — at inference you have to budget for variance of 200–800 tokens. Improvement in precision@5 by 5–8%.
Document-aware chunking — exploits document structure (markdown headings, HTML <section>, PDF sections via Docling or Unstructured.io). Chunks correspond to the author's logical units. The best choice for 80% of B2B use cases, improvement in precision@5 by 12–18% over fixed-size.

Setting 2: embedding model

Top picks in 2026 — cost/quality/latency tradeoff

For Slovak legal content the gap between text-embedding-ada-002 (0.61) and BGE-M3 (0.81) is 20 percentage points of precision — the simplest change with the highest impact.

Setting 3: reranker

Top picks in 2026

cross-encoder/ms-marco-MiniLM-L-6-v2 - Price: self-hosted, small model (66M parameters) - Multilingual: weak (EN-trained) - Latency: ~30 ms on GPU - When: budget setup, EN-only

No reranker (baseline) - Price: zero - When: PoC, low volume, latency below 200 ms is a hard requirement

A concrete benchmark — adding a reranker to an existing pipeline

Hybrid retrieval — when both keywords and semantics matter

Solution: hybrid retrieval = parallel BM25 (sparse, keyword-based) + dense embedding retrieval, the results are merged (typically with reciprocal_rank_fusion).

For legal / medical / technical standards content hybrid is significantly better. For conversational FAQ content the gap shrinks to 1–2%.

A practical decision framework

1.What kind of content? Conversational FAQ / blog posts → dense embedding is enough. Structured document with sections / paragraphs → document-aware chunking is mandatory.
2.What language? EN-only → MTEB top performer (Voyage, BGE-M3, OpenAI). Multilingual EU languages → BGE-M3, Cohere multilingual-v3.
3.What budget? Cloud-only → OpenAI 3-large + Cohere Rerank 3. Self-hosted → BGE-M3 + BGE-Reranker-v2-m3.
4.What latency do you tolerate? < 200 ms → no reranker or a small cross-encoder. 200–500 ms → full pipeline with reranker.
5.Are keywords critical? Yes → hybrid BM25 + dense. No → pure dense.
6.What query volume? Above 100 RPS a self-hosted stack pays back in 4–8 months.

Practical iteration

Tuning RAG isn't a "set once, forget" project. The iteration stack that has worked for us:

1.Eval set: 200+ realistic queries + manually annotated "ideal" source documents. Without this you can't measure improvement.
2.Baseline: start with a simple pipeline (fixed chunking + one embedding model + no reranker). Score = baseline.
3.Change one parameter at a time: chunking → embedding → reranker → hybrid. After each change measure precision@5, recall@10, latency.
4.Stop when marginal cost > marginal benefit. Typically precision@5 above 0.90 needs disproportionate effort — fine-tuned embeddings or a custom reranker that doesn't pay off for use cases with tolerant quality.

---

Six pillars,one delivery.

Industry & engineering

Electrical & automation

Automation & Control

Data centres & server rooms

AI, software & cloud

Smart home & IoT

RAG Pipeline — 3 Settings That Decide Quality

Why retrieval decides, not the LLM

Setting 1: chunk size + chunking strategy

A) Chunk size (tokens)

B) Splitting strategy (fixed vs semantic vs document-aware)

Setting 2: embedding model

Top picks in 2026 — cost/quality/latency tradeoff

Setting 3: reranker

Top picks in 2026

A concrete benchmark — adding a reranker to an existing pipeline

Hybrid retrieval — when both keywords and semantics matter

A practical decision framework

Practical iteration

RAG Pipeline — 3 Settings That Decide Quality

Why retrieval decides, not the LLM

Setting 1: chunk size + chunking strategy

A) Chunk size (tokens)

B) Splitting strategy (fixed vs semantic vs document-aware)

Setting 2: embedding model

Top picks in 2026 — cost/quality/latency tradeoff

Setting 3: reranker

Top picks in 2026

A concrete benchmark — adding a reranker to an existing pipeline

Hybrid retrieval — when both keywords and semantics matter

A practical decision framework

Practical iteration