When a company decides to fine-tune its own LLM, the first question is usually not "how?" but "do we have the hardware for it?" The answer depends entirely on which method you choose. The difference between full fine-tuning and QLoRA can be more than a tenfold difference in VRAM for a 7B model — which determines whether training runs on a single RTX 4090 or requires a rented A100 cluster.
This article walks through the three main approaches — LoRA, QLoRA, and full fine-tuning — from a practical deployment perspective. You'll get concrete numbers, a decision framework, and warnings about situations where the cheaper method simply isn't enough.
Key concepts for those just getting started
Before we get to the numbers, a brief grounding:
Full fine-tuning updates all model weights. For a 7B model, that means training 7 billion parameters — and keeping gradients and optimizer states in memory alongside them.
LoRA (Low-Rank Adaptation) freezes the original weights and adds small "adapter" matrices with a low rank. Training only touches these adapters, which are orders of magnitude smaller than the original model.
QLoRA (Quantized LoRA) goes one step further: the frozen base model weights are quantized to 4-bit, radically reducing their memory footprint. The LoRA adapters remain in higher precision (BF16). Training runs on dequantized values — which introduces a slight slowdown, but a massive VRAM saving.
The numbers that matter: VRAM by method
These are verified figures for dense models (not MoE architectures like Llama 4 or Qwen3, where requirements are significantly higher due to loading all experts):
For a 7B model in FP16/BF16: - Full fine-tuning: ~67 GB VRAM - LoRA: ~15 GB VRAM - QLoRA 8-bit: ~9 GB VRAM - QLoRA 4-bit: ~5 GB VRAM
For a 13B model: - Full fine-tuning: ~125 GB VRAM - LoRA: ~28 GB VRAM - QLoRA 8-bit: ~17 GB VRAM - QLoRA 4-bit: ~9 GB VRAM
For a 70B model: - Full fine-tuning: ~672 GB VRAM — practically out of reach for a single-node setup - LoRA: ~146 GB VRAM — minimum 2× A100 80GB - QLoRA 8-bit: ~88 GB VRAM — one A100 80GB with tight margins - QLoRA 4-bit: ~46 GB VRAM — one A100 80GB comfortably
Rule of thumb for full FT: budget roughly 10–16 GB VRAM per billion parameters — the lower end of the range (~10 GB, matching the table above) applies with an 8-bit optimizer and gradient checkpointing, the upper end (~16 GB) with a classic 32-bit Adam. The figure includes optimizer states, gradients, and activations.
One important detail: fine-tuning requires far more memory than inference on the same model. A 7B model runs inference in 8 GB, but LoRA training on it needs ~15 GB. This is the most common surprise in practice — "the model runs fine, so why does training crash?"
Where each method makes sense
QLoRA 4-bit is the starting point for anyone without a dedicated training server. A 7B model in QLoRA 4-bit runs comfortably on an RTX 3090 or RTX 4090 (24 GB), and with gradient checkpointing (an additional ~30% VRAM saving at the cost of ~2% slower training) it can even manage on an RTX 3060 12 GB. For exploratory work, proof-of-concept, and instruction tuning on small datasets (thousands of examples), QLoRA 4-bit is the sensible choice.
LoRA in FP16 is a step up in both quality and requirements. For a 7B model you need ~15 GB — so an RTX 4090 or A100 40GB. Quality is roughly 90–95% of full fine-tuning. For most domain tasks (classification, extraction, instruction tuning on company documentation) LoRA FP16 is the optimal compromise.
Full fine-tuning is recommended in three specific cases: 1. Reasoning, mathematics, and coding, where a 5% quality difference is significant 2. Continual learning — sequential fine-tuning across multiple domains (where LoRA has systemic weaknesses, see below) 3. Fundamentally changing model behaviour, not just domain adaptation
For enterprise deployments in the "chatbot over internal docs" or "report classification" category, full fine-tuning is usually unnecessary and expensive. We have seen dozens of projects where QLoRA or LoRA achieved the required quality at a fraction of the cost.
A practical decision framework
Answering these questions in order will shorten your decision window:
1. Which model are you planning to train? Dense 7B or 13B: you have flexibility. Dense 70B: QLoRA 4-bit on an A100 80GB is the realistic entry point. MoE models (e.g. Llama 4, Qwen3): VRAM requirements are significantly higher than the active parameter count suggests — verify concrete numbers for each specific model before planning hardware.
2. What is your goal? Instruction tuning, domain adaptation, tone adjustment → LoRA or QLoRA. Reasoning, coding, math → consider full FT or at least higher-rank LoRA + DoRA. Sequential continual learning → full FT (LoRA has documented problems here).
3. Do you have enough data? For supervised fine-tuning (SFT) the recommended minimum is thousands of high-quality examples — with coverage of the domain's key topics. Training on insufficient data produces a model that answers confidently from gaps. That is a worse outcome than a base model that can at least say "I don't know." The data quality gate matters more than the choice of method.
4. What is your deployment plan?
QLoRA 4-bit is used during training, not in production. After training, the adapter is merged into the base model and deployed at normal precision (BF16/FP16) — or re-quantized for production serving (GGUF for llama.cpp, AWQ/GPTQ for vLLM). "QLoRA production model" is a misnomer you will encounter in documentation and project plans alike.
LoRA hyperparameters: where the tuning room is
Rank (r) is the main lever. Higher rank = more trainable parameters = potentially better quality, but also more memory and a higher risk of overfitting on small datasets.
Proven recommendations:
- rank=4 to rank=8: simple tasks (classification, templated responses), small datasets
- rank=16 to rank=32: more complex instruction tuning, domain adaptation
- rank=64 and above: only when you have large data and hardware headroom; higher rank also increases the risk of "intruder dimensions" (see below)
Alpha (α) is typically set equal to rank, or at 2× rank. For the rsLoRA variant, the theoretically grounded value is α/√r.
Target modules: we recommend training on all-linear layers — q_proj, k_proj, v_proj, o_proj plus gate_proj, up_proj, down_proj. Restricting to attention layers only is an older 2023 pattern that newer frameworks have moved past.
DoRA and other PEFT variants
DoRA (Weight-Decomposed LoRA) decomposes the weight update into magnitude and direction. In practice it closes roughly half the quality gap between LoRA and full FT at an overhead of only +5–10% VRAM. Available in the PEFT library, Unsloth, and Axolotl. For projects where LoRA falls short but full FT is too costly, DoRA is the natural next step.
GaLore optimises training directly in a low-rank gradient space — without an explicit adapter matrix. Results are comparable to full FT at significantly lower memory, but deployment is technically more demanding.
For typical enterprise projects our recommendation is: start with LoRA or QLoRA; if that is not sufficient, try DoRA before jumping to full FT.
What we know about quality: intruder dimensions
The research paper "LoRA vs Full Fine-tuning: An Illusion of Equivalence" (2024) raised an important warning: LoRA and full FT achieve superficially similar results, but through different mechanisms.
LoRA creates so-called "intruder dimensions" — new high-ranking singular vectors that do not exist in the base model. These interfere with the original representation space, which manifests during continual learning — that is, when you train the model sequentially on multiple domains. LoRA shows significantly more catastrophic forgetting in this scenario than full FT.
For a one-time fine-tune on a single domain (which covers 90% of enterprise use cases) this limitation is practically irrelevant. However, if you plan to continuously refine the model on new data or new domains, full FT or continual learning with a replay buffer is the safer approach.
Related article: SFT, DPO, GRPO — which method and when
Frameworks in 2026
Four main tools cover the full stack and compete with each other on details:
`Unsloth` is the fastest single-GPU framework — 2–5× faster than the standard HuggingFace pipeline with ~70% VRAM savings compared to full fine-tuning. Triton-fused kernels, gradient checkpointing on by default, support for Qwen3, Llama 4, and MoE architectures. Open-source for single-GPU; multi-GPU distributed training requires an Unsloth Pro subscription.
`TRL` from HuggingFace is the standard for alignment methods — SFT, DPO, GRPO, and ORPO in a single library. Well documented and integrated into the Transformers ecosystem.
`Axolotl` is a YAML-driven production pipeline — well suited for scaling to multiple GPUs and handling complex pipelines (QAT, sequence parallelism, reward modeling). For projects that need a reproducible, versioned training setup.
`LLaMA-Factory` combines a GUI and CLI and supports the widest range of models — well suited for teams where training is not run exclusively by ML engineers.
Practical choice: for an internal proof-of-concept start with Unsloth (lowest barrier to entry, fastest iteration). For a production pipeline with scaling, consider Axolotl. For alignment experiments (DPO/GRPO) reach for TRL.
More on evaluating results: How to measure whether fine-tuning helped
What to do with a trained adapter
A LoRA adapter is a small file (typically a few to tens of MB) that can be shared and versioned independently of the base model. For deployment you have two options:
Merge into the base model: peft.merge_and_unload() merges the adapter matrices with the original weights. The result is a fully self-contained model with no runtime overhead. Recommended for production.
Runtime loading: the adapter is loaded dynamically at inference time. Flexible for experimentation (multiple adapters for the same base model), but adds latency.
After merging you can quantize the model for more efficient serving — GGUF for llama.cpp and Ollama, AWQ or GPTQ for vLLM. This is the standard production flow: train with QLoRA 4-bit → merge adapter → quantize for serving.
Related topics: LLM Quantization (GGUF, AWQ, GPTQ) and Small fine-tuned model vs large base model
Frequently asked questions
Is QLoRA just a cheaper version of LoRA? Not exactly. QLoRA quantizes the frozen base model weights to 4-bit while the LoRA adapters remain in BF16. The mechanism is different and training is ~20–30% slower due to dequantization overhead. QLoRA is not free — it is a trade-off between VRAM and speed, not simply a discount on quality.
Can I deploy a QLoRA 4-bit model directly to production? Not directly. The "QLoRA 4-bit" description refers to the training process, not the production model. After training, the adapter is merged into the base model and deployed at standard precision, or re-quantized for efficient serving (GGUF, AWQ). "QLoRA production model" is a misnomer.
Why does full fine-tuning need so much more VRAM than inference? Because training keeps far more in memory than just the weights: gradients (same size as the weights), optimizer states (with Adam, 2× the weights on top of that), and activations for backpropagation. A 7B model runs inference in 8 GB, LoRA training on it needs ~15 GB, and full FT needs ~67 GB.
When is LoRA not enough and full fine-tuning required? In three cases: (1) reasoning, mathematics, coding — where a 5% quality difference is significant; (2) continual learning with sequential fine-tuning across multiple domains (LoRA shows higher catastrophic forgetting here); (3) fundamental changes to model behaviour beyond domain adaptation.
Does a higher LoRA rank always mean better results?
Not automatically. Higher rank increases memory and training time, but with small datasets it more often increases overfitting risk. Higher rank also produces more "intruder dimensions," which can increase catastrophic forgetting. For instruction tuning, rank=16 or rank=32 is typically sufficient.
Conclusion
*The choice between LoRA, QLoRA, and full fine-tuning is first and foremost a hardware decision — and only secondarily a question of quality. For the vast majority of enterprise use cases (domain chatbots, classification, document extraction) LoRA or QLoRA will reach the required quality at a fraction of the cost. Reserve full fine-tuning for situations where those five percentage points genuinely matter. At MP Industrial Solutions we help companies make this choice based on concrete data — from dataset analysis and hardware selection through to deployment. If you are considering your first fine-tune or want to check whether your current setup makes sense, we are happy to take a look at your case.*
