Most discussions about LLM customisation revolve around fine-tuning: gather data, run training, wait hours or days, evaluate. Yet there is an entire category of techniques that bypasses that cycle entirely: model merging — combining the weights of several trained models directly in parameter space, with zero training iterations. No GPU training, no gradient descent. Just arithmetic over weights.
It sounds like a shortcut that cannot possibly work. In practice, it works surprisingly well — if you know when and how to use it. This article explains the three main methods (SLERP, TIES, DARE), where merging differs from distillation and ensembling, and what the realistic benefits and limits are for companies considering more advanced work with open-weight models.
What model merging is — and what it is not
Before moving to the methods, an important distinction:
Model merging combines the weights of two or more models that share the same architecture and tokeniser, directly in parameter space. The result is a single model whose weights are some combination of the source models. It requires no training data, no GPU during the merge itself (only RAM to load the weights), and no gradient.
Distillation is different: a larger teacher model generates synthetic responses, on which a smaller student model is trained. Distillation requires training — merging does not. The two approaches can complement each other, but they are not interchangeable. If distillation interests you, it is covered separately in Model Distillation.
Ensembling is also a different approach: multiple models run in parallel at inference time and their outputs are combined by voting or averaged aggregation. Ensembling is more expensive to serve (you are running more models); merging produces a single model with normal inference overhead.
Merging therefore occupies its own category: it combines capabilities without training costs, but at the price of less predictable outcomes.
Why does merging work at all?
The intuition behind merging stems from the observation that models with the same architecture trained from the same (or similar) base model have weights in a similar space. Fine-tuning on domain A shifts weights in one direction; fine-tuning on domain B shifts them in another. A linear combination of those shifts should, in principle, capture both capabilities — unless they cancel each other out.
That "unless" is the core problem that more advanced methods like TIES and DARE address.
The three main methods
SLERP — spherical linear interpolation
SLERP (Spherical Linear Interpolation) is the simplest and oldest method. Originally used to interpolate rotations in 3D graphics, it is applied here to parameter vectors.
Instead of a linear average ((weights_A + weights_B) / 2), SLERP interpolates along a geodesic arc on the hyperspherical surface of parameter space. The result preserves differences in weight directions better than a straight linear average.
In practice:
- Works exclusively with two models — not three or more.
- A single parameter t (0.0 = pure model A, 1.0 = pure model B, 0.5 = midpoint) controls how close the result is to either source.
- The result is sensitive to the choice of t — the optimal value varies by model pair.
- Well-suited for gently "softening" the difference between two fine-tunes (e.g. one model is stronger on style, the other on factual accuracy).
TIES — interference handling
TIES (Trim, Elect Sign, Disjoint Merge) addresses a problem that SLERP ignores: when you naively combine several fine-tunes, their changes in parameter space can conflict — some parameters are shifted positively in model A and negatively in model B, so they cancel out on average and the capability is lost.
TIES resolves this in three steps:
- 1.Trim — prune small changes: only the largest delta parameters (deviations from the base model) are retained. Small changes are mostly noise.
- 2.Elect Sign — choose direction: for each parameter, a majority vote determines the dominant direction of change across models. Models that vote in the minority direction are ignored for that parameter.
- 3.Disjoint Merge — combine: each parameter is contributed only by the models that survived the previous steps.
TIES works with three or more models, making it suitable for building "polyglot" models from several domain-specific fine-tunes — at the cost of higher configuration complexity.
DARE — redundancy reduction
DARE (Drop And REscale) takes a different approach: before merging, it randomly "drops" (sets to zero) a large fraction of each model's delta parameters — typically 80–90 % of them — and rescales the remainder proportionally. The intuition: most delta parameters are redundant or interfering; keeping only a small fraction and rescaling produces a comparable or better result.
In practice, DARE is combined with TIES (DARE+TIES): DARE reduces noise in each source model before TIES applies its interference-reduction logic. This combination is available in mergekit as one of its preset strategies.
Task Arithmetic and other variants
mergekit and the research community also implement additional methods:
- Task Arithmetic: summing "task vectors" (deltas from the base model) with weighting — a simple foundation from which TIES and DARE derive.
- Passthrough: certain layers are taken directly from one model, others from another — unprincipled, but sometimes surprisingly effective when models have unequally strong sections.
mergekit — the tool that ties it together
For practical use, `mergekit` is the de facto standard. It is configured via YAML files, which makes recipes reproducible and versionable. A minimal SLERP configuration:
merge_method: slerp
base_model: meta-llama/Llama-3-8B
models:
- model: ./my-finetune-A
- model: ./my-finetune-B
parameters:
t: 0.5
dtype: bfloat16mergekit can handle most merges on CPU with sufficient RAM (for 7B models in BF16, roughly 30–40 GB RAM, no VRAM required). The merge itself completes in minutes.
The relatively new tokensurgeon feature enables cross-tokeniser weight transplantation — opening the door to merging models from different families (e.g. Qwen and Llama), though with significantly less predictable results and a need for thorough evaluation.
For those who want to skip manual parameter tuning: evolutionary merging exists (Mergenetic and similar tools), where the optimal recipe is found automatically via evolutionary algorithms — the merge runs dozens of iterations with different parameter combinations, each evaluated against a small benchmark set. This method is slower (hours rather than minutes) but reduces dependence on expert intuition.
When merging makes sense
In practice, merging is justified in several specific situations:
Combining capabilities from multiple fine-tunes. You have a model fine-tuned on customer communications and another on technical documentation. You want one model that handles both. Instead of another training run on mixed data, try a merge first — if the capabilities are not in conflict, the result can be comparable.
Rapid exploration in an early stage. Before investing hours of training into every combination of hyperparameters and data mixes, merging lets you quickly explore the space of possibilities. Several merges from existing checkpoints cost less than several training runs.
A fallback when training resources are unavailable. If you lack the GPU capacity for further training but have several partially specialised models, merging is a legitimate alternative.
Fine-grained style adjustment. SLERP with a t value closer to one side gives you a model that "leans a bit more" toward the characteristics of one fine-tune — which may be all you need to achieve the desired tone.
When merging does not make sense
It is equally important to know when not to attempt merging:
Substantially different training distributions. The more the data distributions and objectives of the source models diverge, the higher the probability of interference. Merging a model trained on legal contracts with one trained on poetry will likely produce nothing useful — both fine-tunes pull weights in very different directions.
Different architectures or tokenisers (without tokensurgeon). Merging assumes an identical architecture and the same tokeniser — otherwise there is no consistent parameter space in which to interpolate.
When you need predictable improvement. Merging is experimental. It does not always work. The result must always be evaluated on your specific benchmarks — without evaluation you have no way of knowing whether you changed the model for the worse. If your project requires guarantees, reach for standard fine-tuning, whose results are replicable and controllable. Measuring the quality of fine-tuning changes is covered in How to Measure Whether Fine-Tuning Helped.
Regulated environments. In healthcare, legal, or financial contexts, merging a production model carries more risk than fine-tuning — because of weaker auditability. You cannot show what data the resulting model was trained on. For regulated sectors, we recommend merging only as a tool for internal experiments, not as a path to a production model.
Risks and limitations
Long-tail degradation. Merging is usually evaluated on popular benchmarks. On edge cases specific to your domain, the resulting model may fail in ways that simple benchmarks do not capture.
Quality variance. The same method applied to different model pairs produces dramatically different results. A recipe that worked for one pair of fine-tunes may not work for another.
Uncontrolled safety properties. Merged models can inherit inappropriate behaviour from one of the source models if that model was not sufficiently aligned. This is particularly important when merging models from different trainers.
SLERP/TIES merging is not "always safe" — in the sense that the resulting model may not preserve all desirable properties of the sources. The result must always be evaluated. If you want to avoid the most common pitfalls when experimenting with fine-tuning in general, read 7 Reasons Why Fine-Tuning Fails in Practice.
Merging vs. fine-tuning: which when
A simple decision framework:
- Do you have at least two existing fine-tunes from a shared base model and want to explore a combination? → Try a merge first.
- Do you need a model with specific domain knowledge that none of your fine-tunes have? → Fine-tune on new data.
- Do you need quality guarantees and auditability in a regulated environment? → Fine-tuning with documented data.
- Do you want rapid exploration before investing in training? → A merge is a legitimate initial probe.
- Are you working with a MoE architecture (Llama 4, Qwen3 MoE)? → Merging is considerably more complex and tool support is less mature — verify before investing.
Merging is not a replacement for fine-tuning. It is a complementary tool in the advanced ML engineer's toolbox — valuable precisely where training would be unnecessarily expensive for exploratory questions. The relationship between fine-tuning and merging is similar to that between local LLMs and the cloud: both have their place, and context is everything.
Practical steps for a first attempt
If you want to try merging:
- 1.Start with two fine-tunes from a shared base model, both with documented quality on your benchmarks.
- 2.Install
mergekit, write a YAML configuration for SLERP witht=0.5. - 3.Run the merge (runs on CPU, requires tens of GB of RAM, no GPU needed).
- 4.Evaluate the result on the same benchmarks as the source models — without evaluation you know nothing.
- 5.If the result looks promising, experiment with different values of
tor move to TIES for more models. - 6.Only if evaluation confirms quality → use in production.
Frequently asked questions
Is merging a replacement for fine-tuning?
No. Merging assumes you already have at least one well-trained fine-tune — it works on top of existing training results, it does not replace them. If no fine-tune exists, merging has nothing to work with.
How much RAM do I need to merge 7B models?
For 7B models in BF16, roughly 30–40 GB RAM on CPU. No VRAM is needed — the merge runs on CPU in a few minutes. For 13B models, expect roughly twice as much.
How does DARE differ from random weight removal (pruning)?
Pruning permanently removes parameters to make a model smaller. DARE removes delta parameters (deviations from the base model) before merging, in order to reduce interference noise — the resulting model has the same parameter count as the source models. The motivations are fundamentally different.
Does merging work for MoE models (Llama 4, Qwen3 MoE)?
Technically, partially — mergekit is adding support, but MoE architectures are considerably more complex: beyond the expert weights, the router parameters also need to be handled. Results are less predictable than with dense models and tool support is still evolving. We recommend verifying the current state of mergekit documentation for the specific architecture before proceeding.
Can a merge solve catastrophic forgetting?
Partially — if you have a checkpoint from before the forgetting and one from after fine-tuning, merging between them can reduce regression in general capabilities. This is a legitimate technique, but not a reliable replacement for replay data or regularisation approaches during fine-tuning itself.
*If you are considering working with your own fine-tuned models and are unsure where to start — whether with method selection, data preparation, or merging — we are happy to look at your specific case together. At MP Industrial Solutions we have gone through this process with multiple clients in manufacturing and engineering and we know where the real pitfalls lie.*
