When a company decides to deploy RAG (Retrieval-Augmented Generation) over internal documentation, the first technical question is: where do we store the vectors? That choice is far from neutral — it will affect response latency, monthly infrastructure costs, maintenance complexity, and what will or won't be feasible when data volume doubles. In practice, we see teams reaching most often for one of four solutions: pgvector, Qdrant, Weaviate, and Milvus. Each has a different scope, different strengths, and a different context where it makes sense.
This article is not a benchmark shootout (publicly available benchmarks are inherently dependent on hardware, dimensionality, and workload pattern). It is a decision framework — how to choose based on what your company actually needs.
What all four have in common
All four solutions handle approximate nearest neighbor (ANN) search with an HNSW index, metadata filtering, and integration with standard embedding models. All are open-source (Apache 2.0 or equivalent) with self-host or managed cloud options. A common embedding dimension of 1,024 or 1,536 poses no challenge for any of them.
The differences emerge when you hit scale, consistency requirements, existing infrastructure, or the need for hybrid search (vector + keyword). That is where the four paths diverge.
pgvector — when you already have Postgres
pgvector is a PostgreSQL extension. It does not exist as a standalone database — it runs inside your existing PG instance. This is simultaneously its strongest argument and its primary limitation.
When it makes sense:
If your company already runs PostgreSQL and vector collection size stays in the tens of millions, pgvector is a legitimate production choice. The advantage is fundamental: vectors, structured data, and metadata live in one database, under one transaction, with one backup process. No synchronisation between two systems, no additional infrastructure.
The early-2026 version of pgvector added sparse vector support and significantly improved IVFFlat index performance. With an HNSW index and appropriate configuration, a database on a standard server handles 5,000 to 15,000 QPS against a collection of around 10 million vectors at 1,024 dimensions. That is sufficient for the vast majority of B2B RAG deployments over internal documentation.
Where it has limits:
With a collection exceeding 50 million vectors, Postgres starts hitting architectural constraints — a rowstore optimised for transactional workloads, not pure-ANN throughput. Performance degrades and scaling requires either sharding or migration to a different solution. Another limitation: pgvector has no native hybrid search — combining vector and keyword search requires assembling the pipeline manually, for example via tsvector and a hand-rolled RRF (Reciprocal Rank Fusion).
Typical project profile: internal knowledge base or customer support, collection of 1–5 million vectors, a team already managing an existing Postgres instance, preference for not expanding infrastructure.
Qdrant — performance and single-node self-host
Qdrant is a database built from the ground up for vector search. Written in Rust, with a focus on performance and straightforward self-host deployment — a single binary or Docker container, no external dependencies.
The key capability that sets Qdrant apart in 2026: native ColBERT-style multi-vector (late interaction) support, meaning you can store several vectors per document and search via their interaction. This enables state-of-the-art retrieval without a separate reranker server — Qdrant handles late interaction directly in the index. For projects where recall quality matters and you do not want to operate a separate reranker service, this is a significant advantage.
Performance profile: at 10 million vectors, 1,024 dimensions, and a load of around 1,000 QPS, P99 latency sits at approximately 12 ms. It scales well to hundreds of millions of vectors on a single node with sufficient RAM, or via the Qdrant Cloud managed service for larger installations.
Hybrid search: Qdrant supports sparse vectors (BM25-style) as a first-class feature — dense vector + sparse vector are searched in a single pass, without manually merging results. For a typical enterprise RAG scenario (technical documentation, product catalogues, service manuals) this is ideal — exact keywords (part numbers, codes, specific terms) are captured by the sparse branch, semantics by the dense branch.
When it makes sense:
The team wants a self-hosted solution, predictable latency, native hybrid search, and does not want to manage a distributed cluster. Collection size can range from millions to hundreds of millions of vectors. Single-person infra operators appreciate the simple deployment.
Where it has limits:
For truly large collections (billions of vectors), native distribution is more limited than Milvus. The ecosystem of integration connectors is smaller than Weaviate's.
Weaviate — modular hybrid search
Weaviate has a different philosophy from Qdrant: every capability is a module — embedding generation, keyword indexing, reranking, multimodal input. This trades simplicity for flexibility without custom code.
Key capability: native hybrid search (BM25 + dense vector + metadata filter) as a first-class citizen in the query API. The BM25 index and vector index are integrated into a single query planner with automatic RRF fusion. Weaviate also excels with a rich GraphQL query API, which teams appreciate when they need complex filtering (multi-tenant, access rights, hierarchical metadata).
Performance: indicatively 25,000 to 50,000 QPS at 10 million vectors, P99 latency around 16 ms. Costs for the cloud variant (Weaviate Cloud) are higher than Qdrant, but the team gets managed reranking, monitoring, and an SLA.
When it makes sense:
Multi-tenant deployments (multiple customers or departments in one instance with isolated data), need for advanced filtering, teams that value a managed service with a richer ecosystem. Also a good fit for projects with multimodal content, where the Weaviate image-embedding module saves custom code.
Where it has limits:
The modular architecture adds operational complexity — more configuration, more failure surface. For a straightforward RAG use case, Weaviate is over-engineered. The schema-first approach (defining types before indexing) can slow down prototyping.
Milvus — extreme scale
Milvus is distributed-first from the first line of code. It runs on Kubernetes, with etcd for coordination, S3-compatible storage for persistence, and separate components for ingestion, indexing, and querying. This architecture is overhead for small projects — and an advantage for large ones.
At billion-vector scale, Milvus achieves 100,000+ QPS throughput with horizontal scaling. GPU-accelerated ANN search further reduces latency on large indexes. The managed version — Zilliz Cloud — removes the operational burden.
When it makes sense:
Collections exceeding 100 million vectors, high-throughput workloads (e-commerce personalisation, recommendation systems, vector search over billion-item catalogues), teams with DevOps capacity to operate a Kubernetes stack.
Where it has limits:
For a typical enterprise RAG with millions of vectors, Milvus is unnecessary overhead. A minimal production installation requires multiple components — etcd cluster, MinIO/S3, query nodes, data nodes. For a single-person team, that is too much infrastructure to maintain. Hybrid search exists but is less fluid than in Qdrant or Weaviate.
Decision framework: three questions before choosing
Rather than a feature comparison table (where every cell requires context), we recommend answering three questions in order:
1. What is the collection size today and in two years?
- Up to 5 million vectors and existing Postgres → try
pgvectorfirst. You avoid new infrastructure. - 5 million to hundreds of millions →
QdrantorWeaviatebased on the criteria below. - Billions of vectors or 100,000+ QPS →
Milvus/ Zilliz Cloud.
2. How many people manage the infrastructure?
- One developer or a small team →
Qdrant(simple self-host, binary/Docker, minimal dependencies). - Team with DevOps capacity that prefers a managed service →
Weaviate CloudorZilliz Cloud. - Kubernetes-native team experienced with distributed systems →
Milvus.
3. What is the query structure?
- Pure semantic search → any of these databases.
- Hybrid (keyword precision + semantics) →
Qdrant(sparse + dense natively) orWeaviate(BM25 + dense natively). Usingpgvectorrequires assembling hybrid search manually. - Multi-tenant with isolated data, complex filtering →
Weaviatehas the most mature query planning. - ColBERT / multi-vector late interaction →
Qdrantis the only one of these four with native support in 2026.
Performance numbers in context
Publicly available benchmarks (such as ann-benchmarks.com and other public comparisons) show indicatively:
At a load of around 1,000 QPS on a collection of 10 million vectors at 1,024 dimensions:
pgvector: 5,000–15,000 QPS maximum, P99 latency varies with hardwareQdrant: 30,000–80,000 QPS, P99 around 12 msWeaviate: 25,000–50,000 QPS, P99 around 16 msMilvus: 100,000+ QPS with horizontal scaling, P99 around 18 ms (distributed setup)
These figures are strongly indicative — real results depend on vector dimensionality, indexing strategy (HNSW parameters), hardware (memory, CPU/GPU), workload pattern, and whether the deployment is single-tenant or multi-tenant. A benchmark from another project is only a starting point for your own measurement.
Monthly costs for cloud managed variants at a typical enterprise RAG workload (10 million vectors, ~1,000 QPS) are broadly in the low thousands of euros — with significant variation by provider and region. Self-hosting on your own server reduces variable cost but adds fixed operational costs.
Where pgvector positively surprises
One of the most common misconceptions among SK/EU companies: "pgvector is just a transitional solution for a PoC." In practice, that is not the case. We have seen manufacturing and logistics clients running RAG deployments over 2–3 million vectors in production on pgvector without issues, where the sole motivation was an existing PostgreSQL cluster and a team with no bandwidth for new infrastructure.
The keys to good pgvector performance in production: HNSW index (not IVFFlat), correct maintenance_work_mem configuration for index build, and partitioning as the collection grows. With these settings, pgvector handles the majority of enterprise internal RAG scenarios without migration.
Where pgvector genuinely falls short: advanced hybrid search, multi-vector / ColBERT retrieval, and collections in the tens of millions where retrieval layer performance is critical for UX (sub-20 ms latency).
Integration with RAG frameworks
All four databases have first-class integration in both LlamaIndex and LangChain. LlamaIndex provides native QdrantVectorStore, WeaviateVectorStore, PGVectorStore, and MilvusVectorStore with a consistent interface — meaning migration between databases is a single-line configuration refactor, not a full pipeline rewrite.
For framework selection and RAG pipeline quality evaluation, see RAG pipeline — 3 quality settings — it covers chunking strategies, embedding upgrades, and reranking in detail, all of which are independent of the vector database choice.
The embedding model is a separate decision — for model selection see How to choose an embedding model and for hybrid search combining BM25 and vectors see Hybrid search (BM25 + vectors + reranking).
Frequently asked questions
Can I start with pgvector and migrate to Qdrant later?
Yes, and it is a common pattern. The LlamaIndex vector store abstraction is a thin enough layer — migration requires re-ingesting vectors (reindexing documents) and changing the store configuration. Pipeline logic, chunking, and the embedding model remain unchanged. As long as you store the original texts and metadata in a relational database, re-ingestion is typically a script, not months of work.
Is Qdrant actually faster than Weaviate for my workload?
It depends on the workload pattern. Qdrant shows an advantage in single-tenant scenarios with uniform queries. Weaviate can be comparatively better for complex multi-tenant filtering, where its query planner optimises the intersection of vector and keyword filters. We recommend running your own benchmark on a representative sample of your data — one hour of measurement on your own corpus is more valuable than any public benchmark.
Do we need Milvus if we have 50 million vectors?
Probably not. Qdrant and Weaviate handle hundreds of millions of vectors on a well-configured single-node server. Milvus is justified for billion-vector collections or when 100,000+ QPS is required, where a distributed architecture genuinely delivers savings. For 50 million vectors, Milvus is more overhead than solution.
Does the choice of vector database affect RAG answer quality?
Directly, very little — answer quality depends primarily on chunking quality, the embedding model, and the reranker, not on which database stores the vectors. Indirectly, yes: if the vector database has high latency or low recall on filtered searches, worse context chunks reach the generative model. The database choice primarily affects performance, cost, and operational burden — not the retrieval pipeline logic itself.
Does pgvector work for non-English text?
Yes — a vector database is language-agnostic. It stores and searches vectors regardless of language. Language-specific properties are the responsibility of the embedding model, not the database. For multilingual content we recommend models such as BGE-M3 or Qwen3-Embedding — more in How to choose an embedding model.
*MP Industrial Solutions helps companies design and deploy RAG infrastructure suited to their actual workload — from assessing whether pgvector is sufficient to production deployment of Qdrant or Weaviate on on-premises servers. If you are considering a new deployment or troubleshooting performance issues in an existing vector layer, we are happy to schedule an initial no-obligation call.*
