Vector retrieval is the de-facto standard for RAG deployments in 2026. Most teams set up an embedding model, populate Qdrant or pgvector, and ship to production. It works — until a query like "§ 271 par. 2 of the Labour Code" or "HMI model 7NF-420-C" comes in. Dense embeddings can't handle this: they compress a document into a single vector and the exact string gets lost in the semantic average. This is exactly where hybrid search enters — a combination of keyword retrieval (BM25) and vector search, with a reranker on top.
This article goes deeper than RAG pipeline — 3 quality settings, where we mentioned the reranker only in the context of the full pipeline. Here we focus exclusively on the search layer: why vectors alone are not enough, how hybrid search works mechanically, what RRF fusion is, when a reranker genuinely helps, and how to configure it all in practice.
Why vectors alone are not enough
A dense embedding model (e.g. BGE-M3 or text-embedding-3-large) converts text into a vector in a space of hundreds or thousands of dimensions. Vectors that are close together share similar semantic content. This works superbly for questions such as "what are the conditions for terminating employment", where meaning matters.
The problem arises with lexically specific queries:
- Standard and legislative numbers:
§ 63,EN ISO 12100,EU Regulation 2016/679 - Product and part codes:
SKF-6204-2RS,Siemens SIMATIC S7-1500 - Names, abbreviations, acronyms:
OHS,NBS,FFT analysis - Exact values:
230 V AC,IP67,UL 508A
In these cases, embedding fails for a single reason: the model was not trained to recognise that "SKF-6204-2RS" and "SKF-6205-2RS" are fundamentally different items. The vectors may be close because the two strings share a prefix. The exact match is lost.
BM25 (Best Matching 25) does not have this problem. It is a classic TF-IDF variant statistical method from 1994 that explicitly scores the occurrence of each token in a document against the corpus. For exact strings, BM25 is still better than any embedding — and that holds just as true in 2026 as it did in 2014.
From indicative benchmark tests on synthetic corpora:
- BM25 alone: ~58% precision
- Dense vectors alone: ~79%
- Hybrid (BM25 + dense): ~85–88%
- Hybrid + cross-encoder reranker: ~88–91%
Real-world improvement depends on the dataset, language and query type. For industrial documentation with codes and standards, we see a shift of hybrid vs. dense of around 8–12 percentage points in precision@5 in practice.
How BM25 works
BM25 is a statistical model. For each document and query it computes a score as a weighted sum of TF-IDF (term frequency — inverse document frequency) for every token in the query. Parameters:
k1— controls term-frequency saturation (typically 1.2–2.0). Higher = more linear relationship between frequency and score.b— document length normalisation (0 = none, 1 = full). Default 0.75.
BM25 requires no GPU or special hardware — it is deterministic, fast, and runs on CPU. The rank_bm25 Python library, or native support in Weaviate, OpenSearch, Elasticsearch, and Qdrant (via sparse vectors).
Limitation: BM25 does not understand synonyms or paraphrases. The query "termination of employment" will not find a document that only talks about "dismissal from work". That is exactly what vectors are for.
Reciprocal Rank Fusion — how to merge two result sets
Once we have two separate rankings (BM25 and dense vectors), they need to be merged into one. The most common approach in 2026 is RRF (Reciprocal Rank Fusion).
The mechanics are simple. For each document we compute:
RRF_score(d) = sum( 1 / (k + rank_i(d)) )where rank_i(d) is the position of the document in the i-th ranking and k is a smoothing parameter (typically 60). Documents that appear near the top in both rankings receive the highest score. A document that one system does not know but the other ranks first still gets a respectable score.
Why not a weighted average of scores? BM25 scores and embedding cosine similarities are on different scales — a straightforward weighting would favour one system over the other. RRF works only with rank positions, so scale is irrelevant.
An alternative to RRF: DBSF (Distribution-Based Score Fusion) — normalises scores before fusion, suitable when both systems are calibrated. In practice most implementations start with RRF and try DBSF only when they encounter distribution issues.
The alpha parameter in some frameworks (e.g. Weaviate) simply controls the BM25 vs. dense ratio — alpha=0 = pure BM25, alpha=1 = pure dense, alpha=0.5 = equal contribution from both. The default 0.5 is a good starting point, but for content with a high density of codes and standards try alpha=0.3–0.4 (greater BM25 contribution).
Reranker — why retrieval is not enough
Hybrid search significantly improves recall — the probability that the right document is among the top-K candidates. But precision@5 (how many of the first 5 are genuinely relevant) depends on how well those candidates are ordered.
This is where the reranker (cross-encoder model) comes in. The mechanics:
- 1.Hybrid retrieval returns the top-K documents (typically K = 20–50). Fast.
- 2.The reranker receives each of the K documents together with the query, processes them jointly, and assigns a precise relevance score. Slower, but more accurate.
- 3.The top-N (typically N = 5) with the highest reranker scores go into the LLM context.
Why is a reranker more accurate than an embedding? A bi-encoder embedding compresses the document and the query separately into vectors — their interaction is evaluated only through cosine distance. A cross-encoder sees the query and document simultaneously, which allows it to capture fine-grained token-level interactions that the bi-encoder loses.
Typical precision@5 increase after adding a reranker: 7–12 percentage points. Additional latency: 50–200 ms for top-20 reranking.
Current best-in-class options:
- Cohere Rerank 3.5 — managed API, excellent for multilingual content including Central Europe, easy integration
- BGE-reranker-v2-m3 — open-weight, self-hosted, multilingual, outstanding performance-to-cost ratio
- ms-marco-MiniLM-L-6-v2 — small, fast, English prototyping
The same principle that applies to choosing an embedding model applies here: multilingual EU content deserves a multilingual reranker.
When hybrid search helps significantly (and when it doesn't)
Hybrid search helps significantly when:
- The content contains codes, numbers, abbreviations, standards, product identifiers
- Users submit exact queries (copy-pasted from a document, part code from a purchase order)
- The language is not exclusively English — BM25 needs no language model and works just as well for other languages
- Recall is the problem — "the relevant document wasn't found" is a common complaint
- The corpus contains thematically very similar documents (e.g. 5,000 similarly structured technical datasheets)
Hybrid search helps less when:
- Queries are exclusively conversational and paraphrase-based (FAQ chatbot, HR responses)
- The corpus is small (under 500 documents) — the added latency and complexity are not worth it
- Latency is critical — BM25 scoring adds time on every call (typically +10–30 ms), and a reranker potentially adds +50–200 ms
- Content is primarily in English and the embeddings are high quality — dense vectors will approach hybrid retrieval results
We observed this at a spare-parts manufacturing client: a corpus of 18,000 technical datasheets with SKF, NSK, and FAG codes. Pure dense retrieval, precision@5 = 0.63. After switching to hybrid with alpha=0.35 and BGE-reranker-v2-m3: precision@5 = 0.84. The main reason — exact match on the part code in the customer's query.
Practical setup — 2026 stack
Qdrant (recommended for B2B industrial deployments)
Qdrant supports native hybrid search via sparse + dense vectors in a single collection. Sparse vectors for BM25 are generated via BM25Encoder from the fastembed library.
# initialise sparse encoder
from fastembed import SparseTextEmbedding
sparse_model = SparseTextEmbedding("Qdrant/bm25")
# at indexing time
sparse_vec = sparse_model.embed(text)
dense_vec = embedding_model.embed(text)
# query: hybrid search with RRF fusion
results = client.query_points(
collection_name="documents",
prefetch=[
Prefetch(query=sparse_vec, using="sparse", limit=20),
Prefetch(query=dense_vec, using="dense", limit=20),
],
query=FusionQuery(fusion=Fusion.RRF),
limit=5
)Collection configuration: two vector indexes — dense (dimension according to the embedding model, cosine) and sparse (sparse index, dot). Qdrant handles both in a single pass.
Weaviate
Weaviate has hybrid search as a first-class feature — BM25 + dense vectors + RRF fusion in a single API call. Configure alpha directly in the query:
results = collection.query.hybrid(
query="§ 63 Labour Code dismissal",
alpha=0.4, # greater BM25 contribution
limit=20
)Then apply a reranker via Cohere or a local cross-encoder on the top-20 results.
Reranking layer
Regardless of the vector database, the reranker is a separate step. Recommended setup:
- K = 20–30 candidates from hybrid retrieval
- Reranker on K candidates → N = 5 results for the LLM
- For self-hosted:
BGE-reranker-v2-m3on GPU; on CPU handles ~500 ms/query for top-20 - For cloud: Cohere Rerank 3.5 API, EU endpoint
Adding a reranker to an existing hybrid retrieval is typically 20–40 lines of code. When we talk about evaluating a RAG pipeline with RAGAS, this step should have its own precision@5 metric before and after.
Latency: what each layer adds
Typical latencies at 10M vectors, RTX 4090 GPU (indicative):
- BM25 scoring on CPU: 10–30 ms
- Dense vector search (Qdrant): 5–15 ms
- RRF fusion: under 5 ms
- Reranker cross-encoder top-20 (GPU): 80–150 ms
- Reranker cross-encoder top-20 (CPU): 400–800 ms
- Cohere Rerank 3.5 API (network): 100–200 ms
Full pipeline: hybrid retrieval + GPU reranker = typically 200–400 ms end-to-end for the search layer. For interactive chatbots this is usually acceptable. For systems requiring under 150 ms, consider the reranker only when network latency is low, or replace it with a simpler bi-encoder resorting step.
A more detailed comparison of vector databases and their QPS parameters is available in Vector databases — Qdrant, Weaviate, pgvector, Milvus.
Frequently asked questions
Do I need to design the system with hybrid retrieval from the start, or can I add BM25 later?
The BM25 index is independent of the embedding index and can be added to an existing corpus at any time. In Qdrant it is sufficient to add a sparse vector index to the collection and reindex the documents — the dense indexes remain untouched. Weaviate, Milvus and pgvector follow a similar procedure. Reindexing time depends on corpus size, but for a typical B2B knowledge base (up to 1M documents) it takes hours, not days.
What is better for non-English content: BM25 or dense embedding?
It depends on the nature of the queries. BM25 works just as well for any language as it does for English — it is a statistical model with no language dependency. For semantic questions (paraphrases, synonyms), a multilingual dense embedding such as BGE-M3 or Qwen3-Embedding-8B is better. In practice for non-English B2B corpora: hybrid with alpha=0.4–0.5 gives the best results.
When does a reranker not help?
A reranker only reorders the documents that retrieval returned. If the right document is not in the top-K candidates, the reranker cannot rescue it. This means: fix recall first (hybrid retrieval, larger K), then add the reranker. If precision@5 stagnates even after reranking, the problem is usually recall — the right document simply is not among the candidates.
What is the difference between hybrid search and agentic RAG?
Hybrid search is a technique at the retrieval layer — it improves which documents end up in the LLM context. Agentic RAG is an architectural pattern where an agent actively decides what to search for and how — iterative querying, multiple sources, sub-queries. Hybrid search and agentic RAG are not mutually exclusive: a well-built agent uses hybrid retrieval under the hood.
Can hybrid search be deployed on-prem without cloud dependencies?
Yes. Both Qdrant and Weaviate are open-source and run self-hosted. BGE-reranker-v2-m3 is an open-weight model that runs locally on GPU. BM25 requires no external service. The entire stack — hybrid retrieval + reranker — can run completely on-prem, which matters for regulated industries. More on the topic: on-prem LLM for regulated industries.
---
*If your RAG system is struggling to reliably find exact codes, standards or product identifiers, the problem is almost always a missing keyword layer. We help companies design and deploy hybrid retrieval stacks — from selecting the database and embedding model to configuring the reranker and measuring the improvement.*
