Multimodal RAG: documents with tables, images and diagrams

Q: Can I use local VLM models, or do I have to pay for a cloud API?

Local VLMs are a fully functional alternative. `Qwen2.5-VL` in the 7B variant runs on a consumer GPU with 16–24 GB of VRAM; the 72B variant in 4-bit quantisation needs roughly 40+ GB of VRAM. For a caption-and-index pipeline where the VLM runs at ingestion time (not at every query), a local model is economically advantageous for larger corpora. Disadvantage: slower throughput during indexing. Cloud APIs (GPT-4o, Gemini) deliver higher description quality, but at 500,000+ pages the cost can be significant.

Q: How should I handle documents in multiple languages — Slovak, English, German?

For embedding we recommend `BGE-M3` or `Qwen3-Embedding-8B` — both models cover Slovak, Czech, Polish, German and English within their multilingual training. For VLM captioning of industrial documents, English remains the recommended language for descriptions (models are considerably stronger in it), but retrieval works in Slovak as well, thanks to multilingual embeddings.

Q: When does it make sense to use a vector database instead of pgvector?

`pgvector` is a legitimate production choice up to 50 million vectors when you have existing PostgreSQL infrastructure. For multi-vector retrieval (ColPali style) with dozens of vectors per page and a million pages, `Qdrant` with native ColBERT support is significantly more efficient. A comparison of databases can be found in the article [vector databases — comparison](/en/blog/vector-db-porovnanie).

You receive the technical documentation for a production line: 1,200 pages of PDF. Roughly a third are parameter tables, another third hydraulic and electrical schematics, and the rest is continuous prose. You set up a RAG pipeline, index the documents, test the first twenty questions — and three quarters of them come back with wrong or incomplete answers. A technician asks about the current load of a specific drive listed in a table on page 847. The model answers confidently and incorrectly, because it never saw the table — the naive pipeline parsed it as scrambled text and, at the chunking step, discarded the relationship between values and column headers.

This is not a theoretical problem. In industrial environments — manufacturing, energy, mechanical engineering — up to 40–60 % of the information value in documents is tied to non-text content: tables, drawings, P&ID diagrams, wiring schematics, production tolerances in charts. Text-only RAG systematically loses this information. Multimodal RAG addresses these problems — but not for free, and not in the same way for every content type. This article helps you decide when to invest in it, which tools to use, and where the real challenges lie.

Where text-only RAG fails and why

Before moving on to solutions, it is important to understand precisely why a naive pipeline fails on non-text content. There are three failure mechanisms:

Tables: Classic token-count chunking slices a table into fragments. Column headers end up in one chunk, values in the next. Merged cells disappear. The result: retrieval finds a fragment containing a number, but with no context explaining what that number means. The model either hallucinates or correctly says "I don't know" — both outcomes are unacceptable for technical documentation.

Images and schematics: A PDF parser extracts text and images separately. If an image is not described by surrounding text, the pipeline ignores it. Most industrial drawings have very sparse textual context — component numbers, a legend — while the informational content is carried by the visual layout. A text-only model cannot see that.

Scanned documents: An older production document is a scanned image. OCR converts it to text but loses the 2-D layout. A table rendered as continuous text looks like a meaningless sequence of numbers. Drawings remain as images without text, which the pipeline ignores.

Three multimodal RAG architectures

There is no single solution. Three approaches have established themselves in practice, each with a different trade-off between quality, cost and complexity.

1. Caption-and-index

The simplest path to multimodal RAG. During the ingestion pipeline you use a vision-language model (VLM) — a model that understands image input — to automatically generate a textual description of every image and table. This description is stored alongside the textual page content and indexed with standard vector retrieval.

Implementation: Unstructured or Docling extracts images and tables from the PDF. For each image you call a VLM (e.g. GPT-4o, Gemini 2.5 Pro, or locally Qwen2.5-VL) with a prompt such as "describe exactly what you see in this technical drawing, including all numbers, codes and labels". The resulting text is then indexed.

Advantages: retrieval works with standard embeddings, the pipeline is straightforward, and storage requirements are moderate — on the order of tens of GB for millions of pages. Disadvantage: quality depends on how accurately the VLM describes the image. Complex P&ID diagrams with dozens of components and ISA-5.1 symbology can be described imprecisely. For critical documents we recommend a human-in-the-loop review when generating descriptions.

2. Page-as-image with multi-vector retrieval (ColPali style)

Instead of extracting text from a PDF, each page is rendered as an image and embedded directly through a vision-language embedding model — typically models from the ColPali family, currently ColQwen2.5, which holds top positions on the ViDoRe V2 leaderboard for visual document retrieval.

The ColPali architecture generates dozens of patch vectors per page (not a single vector), enabling finer capture of detail. Retrieval uses late interaction — the query is compared against each patch vector individually and scores are aggregated (similar to ColBERT for text). The result: significantly higher precision on pages with mixed text-table-image content.

The disadvantage is clear-cut: storage. Where caption-and-index requires 1 vector per page, ColPali requires 100–1,000 vectors. On large corpora (tens of millions of pages) this means single-digit terabytes of vector storage. Qdrant has native support for multi-vector retrieval and ColBERT-style late interaction, which simplifies implementation. For most industrial corpora (10,000–500,000 pages) this overhead is acceptable — very large deployments must calculate storage costs explicitly.

3. Unified multimodal embeddings

The third approach: an embedding model that natively processes both text and images in a single space. Examples: Cohere Embed v4 (128,000-token context window, text + images, enterprise API), voyage-multimodal-3.5 (Voyage AI, supports video frames). The entire page is embedded directly without generating text, producing one vector per page.

Advantages: simplicity, moderate storage requirements (comparable to single-vector text embeddings), no dependency on the quality of VLM-generated descriptions. Disadvantage: these models are currently available only as cloud APIs — they are not suitable for on-premises or air-gapped environments. Retrieval precision is lower than ColPali on the most demanding visual documents, but for most enterprise corpora it is sufficient and significantly better than text-only.

Parsing: Docling and Unstructured

While the embedding architecture determines retrieval quality, parsing determines what enters the index in the first place. Two open-source libraries cover most needs:

`Docling` (IBM, Apache 2.0) converts PDFs into structured JSON. It recognises tables, preserves document hierarchy (chapter → subsection → content), and extracts images with references to their position on the page. It is fast even on CPU and does not require a GPU for ordinary text documents. For most industrial PDFs with printed tables it is the starting-point tool.

`Unstructured` has a broader scope: it understands more formats (DOCX, PPTX, HTML, emails, Excel spreadsheets), and has an inference mode for element classification using a model. For scanned documents, use the hi_res mode, which enables OCR and analyses page layout via a vision model. Disadvantage: hi_res is slow without a GPU and introduces a dependency on a Docker image containing the models.

For industrial drawings (P&ID, electrical schematics), neither tool provides semantic understanding — they extract pixels or text, not the logic of the circuit. If you need Q&A over the logical content of a drawing (not just over its labels), you need a specialised VLM prompt with domain knowledge, or manual annotations. We cover this topic in more depth in the article LLM over industrial documentation.

Tables: the most common problem in practice

Tables are a special case that deserves its own section. In practice we see three types of tables in industrial documents, each requiring a different approach:

Parameter tables (device types × values) — well handled by Docling or the LlamaIndex table parser. Convert to Markdown or JSON representation before indexing. Preserve the header as part of every row during the chunking step — parent-child chunking where the parent is the full table and the child is a row with its header is a proven pattern.

Scanned tables — require a VLM. Send the image of the table to the model with an explicit prompt: "Extract the content of this table into JSON format, preserving column headers, all values including units and notes." Verify the result — a VLM can make mistakes with handwritten values or non-standard symbols.

Tables with cross-references (e.g. a part code referencing a drawing) — here a text-only approach is insufficient even with a good parser. You need explicit entity linking: table record id ↔ drawing id, stored in metadata. An agentic RAG approach, where the agent can perform a second retrieval based on a found reference, significantly improves answers here. More on agentic RAG in the article Agentic RAG.

When you actually need multimodal RAG

Multimodal RAG adds complexity and cost. Before deploying it, answer these questions honestly:

What percentage of the information your users ask about is tied to tables or images? If less than 20 %, address it selectively (manual descriptions of key drawings) rather than building a full multimodal pipeline.
Are your documents primarily printed (digital PDF) or scanned? Scanned documents require a GPU during ingestion, which raises indexing costs.
Do you need on-premises deployment, or is a cloud API acceptable? The most capable unified multimodal embeddings are currently only available via API.
How frequently are documents updated? A caption-and-index pipeline is cheaper on the first run, but whenever the corpus is updated you must regenerate image descriptions — VLM calls are not free.

For most industrial deployments with standard technical manuals (printed, structured tables, labelled drawings), caption-and-index with a `Docling`/`Unstructured` parser and a capable VLM is sufficient and substantially cheaper than ColPali.

ColPali or unified multimodal embeddings are worth the investment when you have visually rich documents where a textual description cannot capture the information — for example, complex hydraulic schematics with multi-level branching, or documents where the page layout itself carries information.

Challenges no tutorial mentions

Dozens of deployments have surfaced problems that almost everyone runs into:

Ingestion cost: If you have 500,000 pages and each contains an average of 2 images, generating descriptions via a cloud VLM API can cost on the order of thousands of euros for a single indexing run. Calculate this upfront. Local VLMs (e.g. Qwen2.5-VL 7B or 72B) reduce this cost to the price of GPU hours, but require sufficient VRAM — the 72B model in 4-bit quantisation needs roughly 40+ GB of VRAM.

Document versioning: Technical documentation has revisions. Rev3 of a drawing may carry different values than Rev1. A multimodal pipeline must preserve the revision as metadata, and retrieval must be able to filter by it. It matters whether you want the latest version or a specific revision.

Evaluation blind spots: RAGAS metrics (Faithfulness, Context Recall, Answer Relevancy) work well for text. For multimodal retrieval there is no standardised benchmark against your own documents. You must manually create a gold dataset — 100–200 questions with verified answers — and measure against it. Without this you cannot tell whether an architectural change helped or hurt.

Hallucinations are still present: RAG significantly reduces hallucinations compared to a pure LLM, but does not eliminate them. If the VLM described a table inaccurately during ingestion, the model will answer confidently based on that incorrect description. Importantly: the faithfulness metric measures the consistency of the answer with the provided context — not the factual correctness of the context itself. If the image description is wrong, faithfulness will be high and the answer will still be wrong. More on citations and grounding in the article citations and grounding in RAG.

Recommended architecture for industrial PDFs

If we were to distill experience from deployments in mechanical engineering and the energy sector into a concrete recommendation:

1.Parsing: Docling for printed PDFs, Unstructured hi_res for scanned. Extract tables as standalone entities, images as referenceable blocks.
2.Tables: convert to Markdown or JSON, index with parent-child chunking. The header must be part of every child chunk.
3.Images and drawings: for standard documents, caption-and-index with a domain-oriented VLM prompt (industrial, electrical, hydraulic). For visually intensive documents consider unified multimodal embeddings (Cohere Embed v4 or Voyage) or ColQwen2.5 if the storage overhead is acceptable.
4.Embedding + retrieval: BGE-M3 remains a reliable open-weight choice for SK/CZ/PL industrial texts — it combines dense, sparse and multi-vector in a single model. Hybrid search (BM25 + dense) is almost mandatory for technical documentation due to exact matching of codes and equipment designations.
5.Reranking: BGE-reranker-v2-m3 (self-hosted) or the Cohere Rerank API. It adds latency, but on complex documentation where the same term appears hundreds of times it is decisive.
6.Storage: Qdrant for multi-vector scenarios, pgvector if you have existing PostgreSQL infrastructure and volume under 50 million vectors.
7.Evaluation: a gold dataset of 100–200 questions drawn from real user queries, RAGAS for text metrics, manual annotation for multimodal answers.

Frequently asked questions

Is ColPali always better than caption-and-index?

No. ColPali achieves higher recall on visually demanding documents, but at the cost of 100–1,000× more vectors per page. For most industrial corpora with printed tables and labelled drawings, caption-and-index with a good VLM prompt is sufficient at a fraction of the storage cost. ColPali pays off where the page layout itself carries information that a textual description cannot capture.

Can I use local VLM models, or do I have to pay for a cloud API?

Local VLMs are a fully functional alternative. Qwen2.5-VL in the 7B variant runs on a consumer GPU with 16–24 GB of VRAM; the 72B variant in 4-bit quantisation needs roughly 40+ GB of VRAM. For a caption-and-index pipeline where the VLM runs at ingestion time (not at every query), a local model is economically advantageous for larger corpora. Disadvantage: slower throughput during indexing. Cloud APIs (GPT-4o, Gemini) deliver higher description quality, but at 500,000+ pages the cost can be significant.

How should I handle documents in multiple languages — Slovak, English, German?

For embedding we recommend BGE-M3 or Qwen3-Embedding-8B — both models cover Slovak, Czech, Polish, German and English within their multilingual training. For VLM captioning of industrial documents, English remains the recommended language for descriptions (models are considerably stronger in it), but retrieval works in Slovak as well, thanks to multilingual embeddings.

When does it make sense to use a vector database instead of pgvector?

pgvector is a legitimate production choice up to 50 million vectors when you have existing PostgreSQL infrastructure. For multi-vector retrieval (ColPali style) with dozens of vectors per page and a million pages, Qdrant with native ColBERT support is significantly more efficient. A comparison of databases can be found in the article vector databases — comparison.

What is a typical indexing time and cost for an industrial corpus?

It depends on the approach. Caption-and-index on 100,000 pages with a mix of text, tables and images: with a cloud VLM API, on the order of hundreds of euros; with a local 7B VLM, on the order of GPU hours on a single A100-equivalent. ColPali architecture speeds up indexing (no need to generate descriptions), but the retrieval infrastructure (multi-vector storage) is more costly. Text-only parsing with Docling without a VLM is a pure CPU workload and can handle hundreds of thousands of pages in hours on a standard server.

*Multimodal RAG is not a "fancy feature" — it is a prerequisite for a reliable system wherever tables and drawings carry critical information. In industry, that applies to the majority of technical documentation. If you are facing a similar challenge — documents where a text-only pipeline gives unsatisfactory answers — we are happy to look at your specific case and propose an architecture tailored to your corpus and infrastructure requirements.*