LLM over Industrial Documentation: Manuals, Standards, SOPs

Q: Do we need fine-tuning, or is RAG enough?

For most documentation use cases **RAG is sufficient** — it requires no dataset preparation, no training, and carries no risk of model degradation. Fine-tuning adds value if you want the model to understand your company's specific terminology or to respond in a particular format. A decision framework is in the article [RAG vs. fine-tuning](/en/blog/rag-vs-fine-tuning-rozhodovanie).

A manufacturing floor, two in the morning. An experienced technician is searching through a 3,400-page equipment manual for the type and quantity of lubricant the manufacturer specifies for a bearing operating above 80 °C. The PDF has inconsistent page numbering, the table with the values is a scanned image, and the standard the manual references is in a separate file. After 40 minutes, he finds what he needs. The same situation repeats dozens of times per shift, across an entire plant.

This is exactly the problem that LLMs over industrial documentation are designed to solve. This isn't a buzzword — it's a concrete, verifiable use case where a properly deployed system saves measurable hours per operator per month. This article draws the line: what works, where the naive approach fails, and what you must resolve before you start.

What you actually want to achieve

Before thinking about architecture, clarify what you expect from the system. In practice we see three use cases that look similar but have very different requirements:

1. Search and Q&A for technicians — a technician asks a natural-language question, the system answers with a reference to a specific section of a manual or standard. Not a transcript of text, but an answer with a citation: "According to section 7.3.2 of the XY equipment manual, the prescribed lubricant is ISO VG 220, to be applied every 2,000 operating hours."

2. SOP navigation and step-by-step assistance — an operator is following a procedure and needs an explanation of a step, an alternative when a tool is unavailable, or a quick check that they are doing the right thing. The system must be deterministic and accurate — a wrong answer during a production procedure has direct costs.

3. Regulatory compliance and audit support — an engineer needs to quickly identify which parts of the documentation cover the requirements of a specific standard (e.g. ISO 9001, IEC 62443, ATEX), or identify gaps. This requires understanding of both the standard's structure and the internal documentation.

Each of these use cases has different requirements for accuracy, latency, and the method of evaluation. Mixing them in a single pilot is a common mistake.

Why naive RAG fails on industrial documents

RAG — short for retrieval-augmented generation — is the foundational architecture: split documents into chunks, store them in a vector database, retrieve relevant chunks when a question arrives, and pass them to the model. On general documents this works well. On industrial documents you will hit four problems that require active solutions.

Problem 1: Naive chunking shreds tables

An equipment manual contains a table with 30 rows and 8 columns: equipment type, operating temperature, lubricant type, interval, quantity, standard, note, elevation. Naive character-count chunking splits this table into 4 fragments. Each fragment loses the column headers that appeared on the previous page.

Solution: a document parser that recognises tables and preserves them as a whole, or converts them into a structured form (JSON, Markdown table). Tools like LlamaIndex have specialised parsers for exactly these cases. Multimodal models (e.g. Qwen2.5-VL) can extract tables even from scanned PDFs — which is common in older manuals.

Problem 2: Drawings and diagrams are invisible to a text-only RAG

Electrical schematics, hydraulic diagrams, P&ID (piping and instrumentation diagrams) — these are critical parts of technical documentation that a text-only pipeline simply skips. If a technician asks "where is valve V-12 located on the hydraulic schematic for line L2", a text-only RAG will either say "the information is not found in the documents" or hallucinate.

The solution depends on available resources. The lighter path: create structured text descriptions for key drawings once, manually or with the help of a VLM, and index them alongside the rest of the text. The more demanding path: a fully multimodal pipeline where a VLM generates an image description at indexing time — this works, but requires a computationally expensive model for every document added.

Problem 3: Standard versions and cross-references

Industrial documentation is full of references: "proceed according to ISO 14119:2013, Annex D" or "see drawing D-04-7812-rev3". Naive RAG does not resolve these references — it loads the fragment that mentions the standard, but has no access to the standard's actual content if it is not in the index. The result: an answer that cites the reference but contains no real information.

Solution: disciplined source management. Before deployment you must explicitly decide which standards and which versions are part of the index, and establish an update process for revisions. This is an organisational problem as much as a technical one.

Problem 4: Context window is not unlimited

Even the 1M-token context window of frontier models is not a solution for a 5,000-page manual. A model's attention disperses at extremely long inputs — a phenomenon research has confirmed repeatedly. RAG remains relevant even with long-context models, because it selectively loads only the relevant portions instead of the entire document.

How to build a reliable pipeline

A mature pipeline for industrial documentation adds several layers on top of basic RAG.

Document preparation (ingestion)

This is the longest step and the most underestimated. Before indexing:

Distinguish content types: plain text, tables, images, drawings. Each type deserves different handling.
Scanned documents through OCR — not old Tesseract for complex technical documents, but a specialised VLM or document intelligence API that understands page context (a number in the top-right corner = page number; the same number inside a table = a critical value). More on this layer in the article OCR and document intelligence for industry.
Metadata: for every chunk, store the source document, page number, section, document version, and validity date. Without metadata, citations are impossible.
Hierarchical structure: chapter → subsection → table/procedure. LlamaIndex has native support for hierarchical chunking.

Retrieval — more than vector search alone

Pure vector search has weaknesses: it captures semantically similar sections but may miss exact keyword matches (part number, alarm code, equipment designation). Hybrid search — combining BM25 (keywords) with vectors — matters more in industrial contexts than in general ones. See hybrid search with BM25 and vectors for a detailed breakdown.

On top of hybrid search, add a reranker — a model that reorders results by relevance to the question. The BGE reranker (freely available) or an API (e.g. Cohere Rerank) significantly improves accuracy on long documents with repeating phrases.

Generation with citations

This is the key difference from a standard chatbot: every answer must include a reference to a specific source. The prompt must explicitly require: - Which document, section, page. - Verbatim quotation of critical values rather than paraphrasing. - An explicit statement when information is not found in the available documents.

The last point is decisive. An answer of "I cannot find this information in the available manuals" is far better than a confidently hallucinated value. Configuring a system prompt that encourages the model to express uncertainty explicitly matters more than the choice of model. More on citations and grounding in RAG in the article citations and grounding in RAG.

Evaluation before deployment

Do not expose technicians to a system you have not tested on real questions. A basic evaluation set:

50–100 questions from real situations that technicians actually face
A verified answer with a source reference for each question (gold standard)
Metrics: faithfulness (is the answer consistent with the source?), answer relevance (does it answer the question?), hallucination rate

Production systems target: faithfulness ≥ 95%, hallucination rate ≤ 2%. Correctly implemented RAG reduces hallucinations by 60–71% compared to a plain LLM without grounding — but does not eliminate them entirely.

Model selection: cloud vs. on-premises

Industrial documents follow a different logic than public deployments. Technical documentation frequently contains internal know-how, design parameters, and safety procedures. Many companies, especially in regulated industries, do not want this data leaving their own infrastructure.

On-premises open-weight models are a realistic choice in 2026:

Qwen 2.5 and Qwen 3 family (Apache 2.0 licence, suitable for commercial deployment) — strong at document understanding, including multimodal variants.
Mistral Small (~22B) — a good balance of performance and hardware requirements.
DeepSeek R1/V3 (MIT licence) — strong reasoning, well-suited for complex questions about standards and their interpretation.

Indicative hardware requirements: a 7B model runs on a card with 12–14 GB VRAM (QLoRA/quantized inference); a 22B model requires 24 GB+ or multiple GPUs. More on hardware selection in the article custom PC for local LLMs.

For companies without sensitive data or with a clear data governance policy, frontier models (Claude Sonnet, GPT-4o class) are meaningfully better at complex reasoning over structured documents — at the cost of API charges and data egress.

Organisational prerequisites that technology cannot solve

Across dozens of RAG deployments over documentation, we have found that the technical part is usually the smaller problem compared to the organisational one.

Document version control. If a manual exists in five versions across different shared drives with no clear indication of which is current, you are indexing chaos. Before deploying AI there must be a single source of truth — one authoritative source of current documentation. This is not an AI problem; it is a document management problem.

Ownership of index updates. Who adds a new manual, and when? Who invalidates an old version of a standard? Without a defined process, after six months the system is working with outdated data and technicians stop trusting it.

Realistic expectations from technicians. A system that does not answer 10% of questions and says "I don't know" is a correctly configured system. If technicians expect 100% coverage, the first "I don't know" will be interpreted as a failure. Onboarding is part of the project.

Where the system fails — and what to do about it

Even a well-built system has limits. Honestly defining those limits before go-live protects the project from disappointment.

Procedural safety: The system should never be the sole source of truth for procedures involving live electrical work, work in ATEX zones, or other safety-critical operations. RAG vs. fine-tuning for industrial deployments discusses when fine-tuning is better than RAG for exactly these cases — but even a fine-tuned model does not replace certified training and review by a responsible person.

New equipment without documentation: If a piece of equipment is not in the index, the system will not answer. This is a feature, not a bug — the alternative would be hallucinated documentation that does not exist.

Complex diagnostics: Questions such as "why is my machine vibrating more than usual" go beyond document Q&A. This is the territory of predictive analytics and sensor data, not RAG over manuals.

Frequently asked questions

Do we need fine-tuning, or is RAG enough?

For most documentation use cases RAG is sufficient — it requires no dataset preparation, no training, and carries no risk of model degradation. Fine-tuning adds value if you want the model to understand your company's specific terminology or to respond in a particular format. A decision framework is in the article RAG vs. fine-tuning.

How do we ensure the system won't hallucinate technical values?

A combination of three measures: (1) a prompt that explicitly prohibits answering outside available sources and requires a citation, (2) a reranker that improves the quality of the retrieved context, (3) an evaluation set with regular testing. A hallucination rate below 2–3% is achievable with disciplined implementation — zero never is.

Can the system handle documents in German, English, and Slovak?

Yes, modern embedding models (e.g. BGE-M3) are multilingual and perform well across languages. Questions and answers can be in a different language from the source document. In practice we recommend indexing documents in their original language and letting the model translate in the answer — this preserves the accuracy of terminology.

How long does implementing a pilot take?

A pilot system for one document type (e.g. manuals for a single piece of equipment) with an evaluation set and a basic UI falls in the range of 4–8 weeks. Data preparation (OCR, cleaning, version management) takes longer than the technical implementation itself. A full production system with multiple document types and integration into existing systems: 3–6 months.

Does this work for ISO standards and regulatory documents?

Yes, but with an important caveat: standards are subject to copyright (ISO, EN, STN). You cannot simply upload a purchased standard into an internal system — verify the licensing terms. In practice, companies index their internal documents that implement the standard (technical specifications, checklists) and reference specific requirement numbers rather than quoting normative text verbatim.

*If you are considering the first steps with your specific documentation — validating the use case, assessing available data, proposing an architecture — we are available for a consultation. MP Industrial Solutions implements RAG over industrial documentation from initial data preparation through to production deployment.*