When we first deployed an LLM in a production environment for a manufacturing client, it took us three weeks to understand why the model would occasionally respond in English instead of Slovak, why it ignored formatting, and why it would invent a number out of thin air one time in ten. All three problems had the same solution: a systematic approach to prompts — not intuition, not trial and error, but a proven structure.
Prompt engineering is today one of the most frequently discussed, yet least correctly understood, disciplines in AI projects. When we ask technical directors what they expect from it, most say "better answers." That is true — but only up to a point. This article breaks down what prompt engineering can realistically achieve, where its limits lie, and what belongs to a different part of the toolbox.
What prompt engineering actually is
Prompt engineering is the systematic design and versioning of inputs for a language model with the goal of achieving consistent, predictable, and measurable outputs. The key word is consistent — not "occasionally brilliant."
In a production system it is not enough for a prompt to work 80% of the time. If you deploy it in customer support, customers will remember the 20% failures. If you deploy it in invoice processing, 20% errors are a disaster. Prompt engineering is therefore less about creativity and more about engineering discipline: documentation, versioning, testing, measurement.
What works — proven patterns
1. A clear instruction with an explicit output format
The single biggest influence on output quality is how precisely the model knows what it is supposed to produce. A vague instruction "summarise this document" gives vague results. A specific instruction works dramatically better:
- Tell the model who the target audience of the output is
- Define the exact format: number of bullet points, maximum length, whether the output should include a conclusion or not
- For structured data, always use structured outputs and JSON mode — a model constrained to a schema hallucinates significantly less than a model with free-text output
We have seen with clients that simply moving from "extract the data from the invoice" to "return a JSON object with the fields: supplier (string), registration number (string, 8 characters), amount (number), date (ISO 8601)" reduced extraction errors by an order of magnitude in percentage terms.
2. Few-shot examples — when and how to use them
Few-shot prompting (a handful of input-output examples placed directly in the prompt) is one of the most reliable tools, but only when the examples are representative. The rules:
- 1.Examples must cover edge cases, not just the happy path
- 2.The format of examples must be consistent — if you use "Input:" once and "Query:" another time, the model will notice
- 3.The number of examples depends on task complexity — for simple classification 2–3 is enough, for multi-step reasoning 5–7
- 4.Choose examples from real data, not invented ones — synthetic examples tend to have different statistical properties than production inputs
Few-shot is particularly powerful for formatting and for tasks with a clear structural pattern. It is less effective for knowledge gaps where the model lacks domain context — the answer there is RAG or fine-tuning.
3. Role and persona
Assigning an explicit role works — not because the model "believes" it has become someone else, but because the role shifts the probability distribution of outputs towards more relevant patterns. "You are a senior lawyer specialising in Slovak employment law" activates different linguistic patterns than "You are an assistant."
Important: a role defines tone and vocabulary, but does not replace knowledge. A model given the role "ATEX directive expert" can still hallucinate if it has insufficient training data on ATEX directives. A role is a linguistic filter, not a knowledge database.
4. Chaining steps (Chain-of-Thought)
For complex analytical tasks — comparison, multi-step calculation, conditional decision-making — an explicit instruction to "proceed step by step" or splitting the task into multiple sub-prompts significantly improves reliability. Models make fewer errors when they can "think out loud" before giving a final answer.
In practice this means: do not ask the model for a conclusion directly. Let it first decompose the problem, identify the relevant factors, and only then formulate a conclusion. The result is more reliable and easier to verify — you can see how the model arrived at its conclusion.
5. Negative instructions — what not to do
Most prompts tell the model what to do. Equally important is telling it what not to do. "Never begin a response with 'Of course!'", "Do not use jargon where a plain equivalent exists", "If you cannot answer from the provided context, say so explicitly — do not speculate." Negative instructions reduce the frequency of specific recurring errors.
What does not work — myths
Myth 1: "Please" and rewards improve results
Short answer: no. Prompts like "for every correct answer you will receive $100" or a polite "please help me with…" have no measurable effect on production output quality. The model does not respond to social manipulation or emotion — it responds to statistical patterns in training data. Time spent "motivating" the model is time wasted.
Myth 2: A longer prompt = a better result
Length and quality are two different things. We have seen system prompts with thousands of tokens that performed worse than five-line ones. The problem is that models tend to "lose" instructions buried in the middle of a very long context — an effect known as "lost in the middle." Structure and clarity matter more than volume. If your prompt keeps growing, it is more likely a symptom that you are trying to solve with a prompt a problem that belongs in system design.
Myth 3: A good prompt replaces an eval
This is probably the most dangerous myth. A prompt without evaluation is a hypothesis, not a solution. In practice we see teams that spend a week tuning a prompt, deploy it to production, and discover a month later that 15% of outputs are incorrect — because they never measured systematically.
Eval — automated assessment of outputs on a test set — is a prerequisite for production deployment, not an optional add-on. More on how to measure the quality of an LLM application.
Myth 4: Prompt engineering is a permanent solution
A prompt is a fragile artifact. When a provider releases a new version of the model, your prompt may behave differently — even if you have not changed it. We have seen cases where a model update improved general quality but broke specific formatting instructions that were "glued" to the behaviour of the older version.
Prompt engineering is iterative, not one-off work. If you do not realise this before deployment, you will realise it after.
Myth 5: The system prompt is a safe container for sensitive information
The system prompt is an instruction to the model, not a safe. Prompt extraction (extracting the contents of the system prompt) is a well-documented attack — there are techniques by which an attacker can pull the system prompt content out of the model through a normal user interface. The system prompt should never contain: passwords, API keys, internal price lists, personal employee data, or any information whose disclosure would be a problem. System prompts are for behavioural instructions, not secrets.
Prompt as code — versioning and management
In the teams we work with, the same problem keeps appearing: the prompt is stored in code as a hardcoded string, nobody knows who last changed it or why, and when something goes wrong there is no history to roll back to.
A prompt is code. The same principles apply:
- Versioning in git — every change to a prompt is a commit with a meaningful commit message
- Semantic versioning — major version for a change that alters output behaviour; minor for an addition; patch for a typo fix
- Test suite — every version of a prompt has a set of test cases with expected outputs; a prompt change passes only if the tests are green
- Prompt register — a central place (can be a simple YAML in the repo, or a tool like
Langfuse) where it is visible which prompt is deployed where, in which version, and who the owner is
Tools like Promptfoo allow automated prompt testing in a CI/CD pipeline — every pull request that modifies a prompt automatically runs the test suite before merge. This is the practice we recommend to every team that has more than one active prompt in production.
When a prompt is not enough — and what to use instead
Prompt engineering is a powerful tool in a narrow band of problems. It is weak where:
Knowledge is missing — the model does not know what it needs to know because it is not in the training data. A prompt cannot add knowledge. Solution: RAG (retrieval-augmented generation) for factual grounding, or fine-tuning for deep domain adaptation.
A consistent format is missing — the output must always conform to a precise schema. A prompt can help, but the more reliable solution is structured outputs (JSON schema bound to the model). More in the article on structured outputs and JSON mode.
Security is missing — if the system processes untrusted inputs (e.g. customer messages, external documents), prompt injection is a real risk. A prompt cannot prevent its own overwriting. You need architectural guardrails — input validation, action allowlists, sandboxing. More detail in the article on prompt injection and jailbreaks.
Measurement is missing — if you do not know whether a prompt is working, prompt engineering is just guessing. Without an eval, do not trust even an excellent prompt.
Practical prompt versioning — how to start
For teams that do not yet have systematic prompt management, we recommend a simple process:
- 1.Create a
prompts/directory in the repository with one YAML file per prompt - 2.Each YAML contains:
name,version,description,system_prompt,user_prompt_template,model,temperature,test_cases - 3.Changes go through a pull request — prompt review the same as code review
- 4.Before deploying a new version: run the test suite against both the current and the new version, compare the metrics
- 5.After deployment: monitor results in production; anomalies (a sudden spike in error rate, changed output length) are a sign of regression
This is not overkill — it is the minimum. Teams that skip this discover problems in production instead of in testing.
Prompt engineering and models — what changes with the model
An important thing we regularly see ignored: different models respond differently to the same prompt. A prompt optimised for one model may perform significantly worse on another. A few concrete patterns:
- Instruction following: newer frontier models (Claude, GPT-4 class) are significantly better at following explicit instructions than older or smaller models
- Few-shot sensitivity: some models respond strongly to few-shot examples, others less so — it depends on size and training procedure
- Language quality: Slovak is a low-resource language; frontier models handle it well, but smaller open-weight models can degrade — always test in the target language
- Temperature: for deterministic tasks (extraction, classification) set
temperature=0or very low — outputs will be more consistent; for generative tasks (writing) slightly higher
When you change a model, always run the existing test suite. Do not assume the prompt "works the same."
Frequently asked questions
Is it worth investing time in prompt engineering, or is it better to go straight to fine-tuning?
Prompt engineering is the first step; fine-tuning is the next — not an alternative. If you have a well-designed prompt and are still unhappy with results despite few-shot examples and structure, investigate whether the problem is missing domain knowledge (solution: RAG) or inconsistency in output style (solution: fine-tuning). Fine-tuning without a well-designed prompt and without an eval is wasted time and money.
Can I use prompt engineering to ensure the model never responds on certain topics?
Not reliably. Prompt-only defences (an instruction "never respond on X") can be circumvented — that is the main conclusion of prompt injection and jailbreaking research. For production systems, architectural measures are essential: filtering inputs before they reach the model, validating outputs before displaying them, monitoring. An instruction in the prompt is the first layer, not the only one.
How do I know whether my prompt is good?
Only through measurement. Define a test set of cases (at minimum 20–50 for a simple task, more for complex ones) with expected outputs or evaluation criteria. Run the prompt on the test set and measure the relevant metric (accuracy, formatting, language correctness, factual correctness). Without a number, a "good prompt" is just a feeling.
How many tokens should a system prompt have?
There is no universal answer, but in practice we find that effective system prompts are typically in the 200–800 token range. Shorter ones are insufficiently specific; longer ones begin to suffer from the "lost in the middle" effect — the model loses track of instructions stored in the middle. If your prompt exceeds a thousand tokens without a clear reason, it is time to refactor it or consider whether some information belongs in RAG context rather than in the system prompt.
Does the same apply to open-weight models deployed locally?
Most principles apply, but with an important caveat: smaller open-weight models (7B–13B parameters) are less robust at following instructions than frontier models. Few-shot examples matter more, prompt formatting must be more precise, and the test suite must be tailored to the specific model family. A prompt designed for frontier models may not work equally well on Mistral 7B or similarly sized open-weight models — even after careful optimisation. Slovak also presents an extra challenge for smaller models — always test in the production language.
*Prompt engineering is a craft, not magic — and like any craft it can be learned, systematised, and measured. If you are working on deploying an LLM to production and are not sure where to start, we are happy to assess your specific case and help you establish the foundations to build on.*
