When a client comes to us with "deploy a RAG system over our technical documentation," the first debate almost always circles around the large language model: Claude or GPT? Llama or Mistral? Local or cloud? The embedding model gets left out of the conversation entirely — most teams grab OpenAI text-embedding-ada-002 because it appeared in the first quickstart they read, or whatever the framework they're using surfaces in its docs.
The reality from production: the embedding model is where 15–20% of total retrieval pipeline quality is decided. A poorly chosen model means relevant documents end up in the top-20 results but not the top-5 — and the LLM answers from irrelevant context. For non-English content this effect is more pronounced than for English, because most popular models are trained predominantly on EN data. This article gives a concrete selection framework, including what works and what doesn't for languages outside English.
What an embedding model does — and why the choice matters
An embedding model converts text (a document, a question, a sentence) into a vector — a list of numbers where similar texts have geometrically close vectors. Retrieval then finds vectors close to the query. Everything else in the RAG pipeline depends on whether "close in space" actually means "semantically relevant."
Two dimensions along which models differ:
Vector dimension (768, 1,024, 3,072, 4,096) — higher dimensions let the model capture more semantic information, but increase memory requirements, storage costs, and similarity search latency. Modern Matryoshka models allow dimension reduction after training (e.g. from 3,072 to 768) with minimal quality loss — relevant when scaling to tens of millions of vectors.
Context window — how many tokens the model effectively processes when embedding a single chunk. Older models had an effective window of around 512 tokens even at a nominal 8,192 limit; modern models (BGE-M3, Qwen3-Embedding, NV-Embed-v2) handle long documents without significant quality degradation. For a RAG pipeline with document-aware chunking this is directly relevant — if your chunks are 600–800 tokens, a model with an effective window of 512 tokens will truncate them.
MTEB: a useful reference, not dogma
MTEB (Massive Text Embedding Benchmark) is the most widely used benchmark for embedding models. It measures performance across dozens of tasks: retrieval, clustering, classification, semantic similarity. Results are publicly available on the Hugging Face leaderboard and are a good starting point.
Three limitations to keep in mind:
- MTEB is primarily English. A multilingual track exists, but it covers only certain languages — Slovak, for example, is not a standard MTEB language path. MTEB multilingual results are therefore indicative, not a guarantee of non-English performance.
- Benchmark data differs from your data. A model with an MTEB score of 70 can perform far worse on your specific domain content (technical documentation, legal texts, service manuals) than a model scoring 65 that was trained on similar content.
- Benchmark scores do not measure latency or cost. A model with MTEB 70 and 400 ms latency is a worse choice for a real-time application than a model with MTEB 67 and 15 ms.
MTEB is therefore a useful tool for building a shortlist of 3–5 candidates. Make the final decision based on tests against your own data.
Open-weight models: when and why
For EU firms, the argument for a self-hosted embedding model is stronger than for the LLM itself. An embedding model:
- Runs on a standard GPU server. BGE-M3 handles a typical GPU server with latency in the tens of milliseconds per request. This is not the same hardware requirement as a 70B LLM.
- No data leakage. Documents stay in your infrastructure — relevant for regulated industries and GDPR compliance.
- Predictable cost. The amortised cost of a GPU server is fixed; cloud API cost scales linearly with volume.
- Customisation. Given enough domain data, you can fine-tune the model on your own texts — not possible with a cloud API.
BGE-M3 (BAAI/FlagEmbedding) is the production standard for open-weight multilingual deployments in 2026. It combines three retrieval modes in a single pass: dense (semantic), sparse (keyword-based BM25-style), and multi-vector (ColBERT-style, more precise). 100+ languages. Context window of 8,192 tokens. Dimension 1,024 (dense). This is our internal default for EU clients with on-premises deployments.
Qwen3-Embedding (a model family from Alibaba, including an 8B variant) achieves the highest scores on the MTEB multilingual leaderboard in 2026 — around 70.58 for Qwen3-Embedding-8B. Flexible Matryoshka dimension (32–4,096), long context window of 32,768 tokens. For non-English retrieval, this is currently the strongest open-weight candidate if you have sufficient hardware (the 8B model requires on the order of 16 GB VRAM at full precision, less with quantisation).
Llama-Embed-Nemotron-8B (NVIDIA) sits at the top of the multilingual MTEB leaderboard (250+ languages, open-weight, free). If you have NVIDIA hardware and need maximum scores in the open-weight category, this is a strong candidate.
For rapid prototyping or low-cost deployments, smaller models from the sentence-transformers family — all-mpnet-base-v2 or paraphrase-multilingual-mpnet-base-v2 — are sufficient, but their non-English performance is significantly lower than BGE-M3.
Cloud API models: when they make sense
Cloud embedding APIs (OpenAI, Google, Cohere, Voyage AI) make sense in three situations:
- 1.You don't have your own GPU. Deploying on top of a cloud API is simpler, with no hardware management.
- 2.Calls are intermittent and volume is low. With a few thousand requests per day, owning a server is hard to amortise.
- 3.Multimodal requirements. If you are embedding a mix of text and images (e.g. catalogues with technical drawings), cloud models like Cohere Embed v4 are ahead in this area.
OpenAI text-embedding-3-large (3,072 dimensions, Matryoshka, ~$0.13/1M tokens) is a reliable, well-documented choice for English content. For non-English content, performance is somewhat lower than multilingual-optimised models.
OpenAI text-embedding-3-small (~$0.02/1M tokens) is attractive on cost — for English it offers a good performance-to-price ratio, but for multilingual use we recommend 3-large or switching to Cohere.
Cohere Embed v4 stands out for two features: a context window of 128,000 tokens (extremely long documents without chunking) and native multimodal support (text + images). Price ~$0.12/1M tokens. For companies embedding technical documentation that includes images or diagrams, this is a relevant combination.
Gemini Embedding 001 (Google) holds one of the highest MTEB English scores in 2026 (~68), with Matryoshka support from 768 to 3,072 dimensions. Price ~$0.004/1K characters. For English retrieval this is a strong cloud choice; for non-English content the same caveat as for OpenAI models applies.
Non-English content: what works and what doesn't
Slovak is not a standalone language path in the standard MTEB benchmark. Verified Slovak-specific benchmarks for embedding models are not publicly available. What we know from production experience and related benchmarks (MIRACL, MKQA):
- Models trained primarily on English (older
ada-002,all-MiniLM) perform significantly below their EN benchmark score on Slovak texts. - BGE-M3, Qwen3-Embedding, and Llama-Embed-Nemotron cover Slovak as part of their multilingual training data — their performance is close to that on related Slavic languages (Czech, Polish), which works well in practice.
- For Slovak technical documentation (engineering manuals, electrical designs, ČSN/STN standards) we ran internal tests on BGE-M3 vs. OpenAI text-embedding-3-large — BGE-M3 consistently showed 8–12% higher precision@5. It's not a dramatic difference, but it compounds with more complex content.
- If you have enough Slovak domain data (~5,000+ documents), it is possible to fine-tune an embedding model on your content (fine-tuning via the
sentence-transformerslibrary). For regulated industries (law, medicine) this can push precision another 5–10%.
For hybrid search (BM25 + vectors), the BM25 keyword layer is more important for Slovak content with precise terminology (standard numbers, legal paragraph references, part codes) than it is for English — the embedding model can normalise morphological forms ("pohonu" vs "pohon"), but BM25 captures exact text strings more reliably.
Dimension vs. quality vs. cost: a practical framework
Higher dimension does not equal better performance. Matryoshka models enable training at full dimension (3,072 or 4,096) and inference at reduced dimension (256, 512, 768) — quality loss is minimal and the gain in speed and storage cost is real.
Indicative recommendations for different scenarios:
- Quick PoC, English content, cloud:
text-embedding-3-small(1,536 dim, low cost) ortext-embedding-3-largereduced to 512 dim via Matryoshka. - Production cloud, multilingual EU content: Cohere Embed v4 (multimodal + long context) or Gemini Embedding 001.
- Self-hosted, Slovak/Czech/Polish content: BGE-M3 is the default. For a larger model with a higher score: Qwen3-Embedding-8B or Llama-Embed-Nemotron-8B.
- Scaling to tens of millions of vectors: consider Matryoshka dimension reduction (e.g. to 768) — at 50M vectors the storage savings are substantial.
- Multimodal content (text + images): Cohere Embed v4 or Voyage AI voyage-multimodal-3.5.
For a comparison of the vector databases where you will store these embeddings, see Vector databases — comparing Qdrant, Weaviate, pgvector, Milvus.
Domain fit: the most commonly overlooked factor
MTEB benchmark scores reflect average performance across diverse test sets. Your real-world content is narrow and specific:
- Engineering documentation (drawing notes, service manuals, ISO standards) — technical language with precise terminology, abbreviations, part numbers. Dense embeddings capture semantics; the BM25 layer captures exact codes. BGE-M3 hybrid mode is an advantage here.
- Legal texts (labour code, contracts, standards) — formal language, paragraph references, emphasis on exact wording. Tests show that a domain-fine-tuned model (trained on Slovak legal texts) outperforms a generic model by 10–15% precision.
- Internal company KB (emails, meeting notes, process documents) — variable language, mixed writing styles. A generic model works well here; fine-tuning only makes sense at high volume (50k+ documents).
- Product catalogue (SKUs, descriptions, technical parameters) — short texts, exact matches. For e-commerce or distributor catalogues BM25 carries significant weight; the embedding model adds semantics ("blue metric screw" = "M6 blue screw DIN 912").
Before selecting a model, answer this question: what proportion of your queries require semantic understanding vs. exact lexical matching? For industrial companies with technical documentation, hybrid retrieval is almost always the right choice — and with hybrid retrieval you need an embedding model that natively supports sparse vectors (BGE-M3), or you must be willing to manage a separate BM25 index.
Evaluation: how to test before deployment
Don't rely solely on MTEB — build a mini eval set:
- 1.Select 100–200 real queries that reflect your production use case.
- 2.For each query, manually identify the "ideal" source documents (ground truth).
- 3.Run retrieval (top-5 or top-10) for each candidate model.
- 4.Measure
precision@5andrecall@10— the percentage of relevant documents in the top-5/10. - 5.Also compare latency and the cost of embedding your entire corpus.
This test will show you the real difference between models on your data in 2–3 days of work. From production experience: on English content, candidates differ by 3–8% precision. On Slovak technical content the differences are more pronounced — we have seen gaps of 15–20% between a weaker EN-primary model and BGE-M3.
For evaluation of the full RAG pipeline (not just retrieval, but generation quality as well) see How to evaluate RAG (RAGAS).
Frequently asked questions
Is BGE-M3 still a current choice in 2026?
Yes. BGE-M3 remains the production standard for open-weight multilingual deployments precisely because of its unique combination of dense + sparse + multi-vector retrieval in a single pass — no other open-weight model offers this in one model. Qwen3-Embedding-8B achieves higher MTEB scores, but requires more hardware and does not provide native sparse retrieval. For most EU clients with an existing GPU server, BGE-M3 remains a solid default.
Do I need a special model for non-English content?
Not necessarily. BGE-M3, Qwen3-Embedding, and Llama-Embed-Nemotron cover Slovak as part of their training data and work well in practice. A dedicated Slovak-trained embedding model does not exist as a public SOTA open-weight model in 2026. If you have a large volume of Slovak domain data (10k+ documents), fine-tuning a generic multilingual model on your content can yield better results — but that is a project in itself, not an out-of-the-box solution.
Can I use one embedding model for both retrieval and reranking?
No — an embedding model (bi-encoder) and a reranker (cross-encoder) are architecturally different. An embedding model encodes documents and queries independently into vectors (fast); a reranker scores a (query + document) pair jointly (more precise, slower). A complete pipeline needs both — more detail in the article RAG pipeline — 3 quality configurations.
How much does it cost to embed an entire company knowledge base?
It depends on volume. As a rough guide: 1 million tokens with OpenAI text-embedding-3-large costs ~$0.13; with 3-small ~$0.02. For a 10,000-page PDF corpus (~5M tokens), the one-time embedding cost is in the range of tens of dollars with a cloud API. With self-hosted BGE-M3, the cost is essentially zero after paying for the GPU server. Re-embedding (when changing the model or chunking strategy) costs the same again — which is why it pays to choose the right model from the start.
When does fine-tuning an embedding model make sense?
When you have domain content where a generic model systematically fails (precision below 70%), you have enough data (typically 5,000+ relevant query-document pairs), and a production system where every percentage point of precision has business value. Regulated industries (law, medicine) are the classic example. For a typical internal knowledge base, fine-tuning is beyond what is needed — BGE-M3 or Qwen3-Embedding will suffice.
*MP Industrial Solutions helps companies design and deploy RAG architecture — from embedding model selection through chunking strategy to vector database and evaluation harness. If you are facing a selection decision or want to measure the performance of an existing deployment, we are happy to conduct a free 90-minute audit of your pipeline.*
