What is model merging?

The catastrophic forgetting problem, why naive ensembles are too expensive, and the surprising geometric insight that makes model merging possible.

How does catastrophic forgetting work in practice?

Why Model Merging Exists covers model merging, catastrophic forgetting, model soup from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/model-merging/why-model-merging

What is the difference between model merging and model soup?

See the full breakdown at https://engineersofai.com/docs/llms/model-merging/why-model-merging

Why Model Merging Exists

The Problem You Will Absolutely Run Into

It's 3 PM on a Thursday. Your team has spent the past three weeks fine-tuning Llama-3-8B on your proprietary customer support dataset. The model is excellent - it knows your product's terminology, it handles edge cases gracefully, it declines appropriately when asked about competitors. Your QA team signs off. You deploy.

Then your PM walks over. Three of your biggest clients are asking for code generation support integrated directly into the chat interface. Can you add that?

You know exactly how to do this. You fine-tune a copy of Llama-3-8B on a high-quality coding dataset - CodeAlpaca, filtered Stack Overflow data. After a week of training runs you have a model that writes excellent Python and can debug customer-submitted scripts.

The problem arrives when you try to combine the two capabilities. You take your customer support model and fine-tune it further on the coding data. What you get back is not the sum of the two models. The coding training has partially overwritten your customer support specialization. Product terminology that was previously solid has degraded. The model that used to handle your domain-specific edge cases now forgets them mid-conversation. This is catastrophic forgetting - and it is one of the most persistent, frustrating problems in applied deep learning.

You could run both models in parallel and route queries to the right one. But that doubles your inference infrastructure. For eight enterprise customers that might be fine. For eighty thousand it's a budget conversation you don't want to have. Model merging is the approach that says: what if you could add the coding weights to the customer support weights, directly, without any retraining? One model. Zero additional inference cost. Both capabilities.

Why This Exists - The History of the Problem

Catastrophic Forgetting Since 1989

Neural networks have struggled with catastrophic forgetting since McCloskey and Cohen described it in 1989. When you train a network on task A and then train it on task B, the gradients from task B update the same weights that encoded task A. Depending on how different the tasks are, some or all of task A's performance degrades.

The scale of the problem depends on task similarity. Fine-tuning a code model on a closely related code style rarely causes much forgetting. Fine-tuning a customer support model on math reasoning can devastate the support capabilities.

Solutions Before Model Merging

The canonical solution - multi-task learning - trains on all tasks simultaneously. This works, but it requires all training data to be available at the same time. That's often impossible: data from different sources may arrive sequentially, may be under different privacy regimes, or may simply be too large to hold together in a single training pipeline.

Continual learning methods like Elastic Weight Consolidation (EWC, Kirkpatrick et al. 2017) add a regularization term penalizing changes to weights that mattered for previous tasks. These work but add complexity to every training run, require tracking which weights matter, and don't compose easily when you have more than two tasks.

Neither approach answered the practitioner's question: I already have two trained models. Can I combine them without retraining either one?

The Naive Ensemble - And Why It's Too Expensive

Before model merging, the practical answer was ensembling: run both models on every input, combine their outputs.

For classification, you average probabilities or take majority vote. For generation, you use more sophisticated strategies - generating from each model and picking the best with a separate scorer.

Ensembling genuinely works. It almost always improves performance over any single model. The problem is cost: $N$ models means $N \times$ memory, $N \times$ compute, $N \times$ latency (unless you run them in parallel, which requires $N \times$ hardware).

For two 8B parameter models in BF16 you need approximately 16GB of VRAM each. Running both requires 32GB - meaning you've been pushed off an A10G (24GB) onto an A100 or H100. For a company serving millions of queries per day, this difference compounds dramatically into operational cost.

The Geometric Insight That Changes Everything

Loss Basins and What Fine-Tuning Actually Does

To understand why model merging is possible, you need a mental model of what fine-tuning does in weight space.

Think of the loss landscape as a high-dimensional surface. During pre-training, SGD descends to some low-loss region - a loss basin. This basin is not a single point; it's a broad, relatively flat region where many parameter configurations give similarly low loss on the pre-training distribution.

When you fine-tune the base model on a new task, you're nudging it within this landscape. The learning rate is small. The number of fine-tuning steps is tiny compared to pre-training. The fine-tuned model is still in the same general region - the same basin - just at a slightly different position.

This is the key insight: two fine-tunes of the same base model are likely to be in the same loss basin. And within a loss basin, the landscape is approximately convex - meaning the segment between two low-loss points also has low loss.

If the path between two fine-tuned models stays at low loss throughout, then any point on that path - including the midpoint - is itself a valid low-loss model. The averaged weights are not random noise; they are a specific model that represents a kind of compromise between the two fine-tuned configurations.

Loss Landscape (schematic)

     High
      |         ___
      |        /   \       ___
Loss  |       /     \     /   \
      |      /       \___/     \
      |_____/                   \______
      Low
                \_____________/
                  Same Basin
                  ^   ^   ^
               θ_base θ_A θ_B

Average(θ_A, θ_B) also inside the basin.

This is not a theoretical guarantee. It's an empirical observation that holds reliably when models share the same base checkpoint and are fine-tuned with moderate learning rates.

Mode Connectivity - The Research Foundation

The formal foundation comes from research on mode connectivity in neural networks. Draxler et al. (2018) and Garipov et al. (2018) showed that two local minima found by SGD are typically connected by a low-loss path - not necessarily a straight line, but a curved path that stays in low-loss territory. For over-parameterized models (which all modern LLMs are), subsequent work showed that even the straight-line path between two fine-tuned models often stays at low loss.

Formally, for model merging to work, we need the loss to be approximately convex on the segment between $\theta_1$ and $\theta_2$ :

$\mathcal{L}\left(\alpha \theta_1 + (1-\alpha)\theta_2\right) \lesssim \alpha \mathcal{L}(\theta_1) + (1-\alpha)\mathcal{L}(\theta_2) \quad \text{for all } \alpha \in [0, 1]$

When this inequality holds, any convex combination of two fine-tuned models has loss no worse than the weighted average of their individual losses.

The Model Soup Paper - The Founding Experiment

In 2022, Mitchell Wortsman and colleagues at the University of Washington published "Model soups: averaging weights of multiple fine-tuned models improves accuracy and robustness". This paper is the founding document of practical model merging.

Their setup: take a CLIP ViT-B/32 image encoder. Fine-tune it on ImageNet with 72 different hyperparameter configurations (learning rate, weight decay, augmentation, label smoothing). Each produces a slightly different model.

Naive assumption: the best single model wins. You run all 72, pick the one with highest validation accuracy, deploy it.

Wortsman et al. tried something different. They averaged the weights of the best-performing models. Not their outputs - their weights. The parameter vectors themselves.

The averaged "model soup" outperformed the best individual model on both in-distribution accuracy and out-of-distribution robustness (ImageNet-V2, ImageNet-R, ImageNet-Sketch, ObjectNet). The soup was also better calibrated - its confidence estimates were more accurate.

This result was genuinely surprising. The community expected the best single training run to be the best deployable artifact. The result showed that diversity in the fine-tuning trajectory, combined with weight averaging, produced a more robust model.

The Greedy Soup Algorithm

Wortsman et al. also described a greedy soup construction procedure that's practical when you have quality variance among your models:

import torch
import copy

def greedy_soup(models_state_dicts, val_accuracies, evaluate_fn):
    """
    Greedy model soup: add models one by one if they improve validation accuracy.

    Args:
        models_state_dicts: list of dicts, sorted by val_accuracy descending
        val_accuracies: corresponding validation accuracies
        evaluate_fn: callable(state_dict) -> float (validation accuracy)

    Returns:
        soup_state_dict: the best greedy soup
        final_accuracy: its validation accuracy
    """
    # Start with the best model
    current_soup = copy.deepcopy(models_state_dicts[0])
    current_count = 1
    best_acc = val_accuracies[0]

    print(f"Starting soup with model 0, accuracy: {best_acc:.4f}")

    for i in range(1, len(models_state_dicts)):
        # Tentatively compute the new average
        candidate = {}
        for key in current_soup:
            candidate[key] = (
                current_soup[key] * current_count + models_state_dicts[i][key]
            ) / (current_count + 1)

        # Evaluate the candidate soup
        candidate_acc = evaluate_fn(candidate)

        if candidate_acc >= best_acc:
            current_soup = candidate
            current_count += 1
            best_acc = candidate_acc
            print(f"  Added model {i}: soup accuracy = {candidate_acc:.4f} (+{candidate_acc - val_accuracies[i]:.4f} vs model alone)")
        else:
            print(f"  Rejected model {i}: would drop accuracy to {candidate_acc:.4f}")

    return current_soup, best_acc


def uniform_soup(models_state_dicts):
    """Simple uniform average of all model weights."""
    result = copy.deepcopy(models_state_dicts[0])
    for state_dict in models_state_dicts[1:]:
        for key in result:
            result[key] = result[key] + state_dict[key]
    n = len(models_state_dicts)
    for key in result:
        result[key] = result[key] / n
    return result

The greedy algorithm provides a guarantee: the resulting soup is at least as good as the best individual model on the validation set. Uniform averaging has no such guarantee, because low-quality models drag down the average.

Task Arithmetic - A More Powerful Framework

In late 2022, Gabriel Ilharco and colleagues introduced task arithmetic in "Editing Models with Task Arithmetic". This reframed model merging more powerfully.

The central object is the task vector:

$\tau_A = \theta_A^{fine-tuned} - \theta_{base}$

The task vector represents the capability delta - what fine-tuning on task A added to the model. Task vectors can be:

Added: Apply a new capability to the base model
Composed: Add multiple task vectors to combine capabilities
Negated: Subtract a task vector to remove a capability
Scaled: Multiply by $\lambda$ to control capability strength

from safetensors.torch import load_file, save_file
import torch

def compute_task_vector(base_path: str, finetuned_path: str) -> dict[str, torch.Tensor]:
    """
    Compute task vector: the difference between fine-tuned and base weights.

    This represents the 'capability delta' added by fine-tuning.
    """
    print(f"Loading base model from {base_path}")
    base = load_file(base_path)

    print(f"Loading fine-tuned model from {finetuned_path}")
    finetuned = load_file(finetuned_path)

    task_vector = {}
    for key in base:
        if key in finetuned:
            # Work in float32 for numerical precision
            task_vector[key] = finetuned[key].float() - base[key].float()

    missing_in_finetuned = set(base.keys()) - set(finetuned.keys())
    if missing_in_finetuned:
        print(f"Warning: {len(missing_in_finetuned)} keys in base not found in fine-tuned model")

    return task_vector


def apply_task_arithmetic(
    base_path: str,
    task_vectors: list[dict[str, torch.Tensor]],
    lambdas: list[float],
    output_path: str,
):
    """
    Apply multiple task vectors to the base model with scaling factors.

    Positive lambda: adds capability
    Negative lambda: removes capability
    """
    assert len(task_vectors) == len(lambdas), "Must have one lambda per task vector"

    base = load_file(base_path)
    result = {k: v.float().clone() for k, v in base.items()}

    for task_vec, lam in zip(task_vectors, lambdas):
        action = "Adding" if lam > 0 else "Removing"
        print(f"  {action} capability with lambda={lam:.2f}")
        for key in result:
            if key in task_vec:
                result[key] = result[key] + lam * task_vec[key]

    # Convert back to bfloat16 to match typical LLM dtype
    result_bf16 = {k: v.bfloat16() for k, v in result.items()}
    save_file(result_bf16, output_path)
    print(f"Merged model saved to {output_path}")


# -------------------------------------------------------
# Example: Combine coding + instruction following
# -------------------------------------------------------
# tau_code = compute_task_vector("meta-llama/Meta-Llama-3-8B",
#                                "codellama/CodeLlama-7b-hf")  # hypothetical
# tau_chat = compute_task_vector("meta-llama/Meta-Llama-3-8B",
#                                "meta-llama/Meta-Llama-3-8B-Instruct")
#
# apply_task_arithmetic(
#     base_path="meta-llama/Meta-Llama-3-8B",
#     task_vectors=[tau_code, tau_chat],
#     lambdas=[0.6, 0.7],     # tune these on a held-out eval set
#     output_path="llama3-code-instruct-merged"
# )

# -------------------------------------------------------
# Example: Negation - remove safety restrictions (for research)
# -------------------------------------------------------
# tau_safe = compute_task_vector("llama3-base", "llama3-safety-finetuned")
# apply_task_arithmetic("llama3-base", [tau_safe], [-0.5], "llama3-less-safe")
# Note: this is a research technique; the resulting model needs careful evaluation

The negation result - using task arithmetic to remove capabilities - was particularly striking. Ilharco et al. showed that subtracting the task vector of a sentiment analysis model degraded the merged model's sentiment performance while largely preserving other capabilities. This suggests task vectors are genuinely capturing separable capability representations.

The Hugging Face Community Effect

The model soup and task arithmetic papers were academic experiments on computer vision and small NLP models. The Hugging Face open-source community scaled them to LLMs and turned them into a production practice.

By 2023, the Open LLM Leaderboard was dominated by merged models. Community practitioners discovered that merging a carefully instruction-tuned model with a domain-specialized model consistently outperformed either one alone. Models like "Goliath 120B" - a merge of two Llama-2 70B fine-tunes - demonstrated that merging could work even at very large scale.

The community contribution was less about new algorithms and more about empirical discovery of what combinations work. Through thousands of experiments:

Models based on the same base checkpoint consistently merged better
Scaling factors between 0.5 and 0.8 typically outperformed 0.5/0.5 splits
Merging more than 3-4 models with simple averaging often degraded performance
Domain-adjacent models (code + math, chat + writing) merged more cleanly than orthogonal pairs

This empirical knowledge eventually fed back into academic research, motivating TIES and DARE (covered in subsequent lessons) which addressed the failure modes practitioners kept hitting.

Applications in Production

Capability stacking: Combine a coding model, an instruction model, and a math model into a single deployable artifact. Zero additional inference overhead versus the base model.

Safety + capability balance: Fine-tune for a new capability, then blend with the safety-aligned original to recover alignment properties that fine-tuning may have degraded. This is used by companies that want to customize models without full safety re-evaluation.

Checkpoint averaging for robustness: Average the last N checkpoints from a long training run. This temporal version of model soup reduces variance and often improves out-of-distribution performance.

Reducing alignment tax: Safety fine-tuning sometimes reduces raw capability (the "alignment tax"). Merging the safety-tuned model with the original at a chosen ratio can recover some capability while retaining most of the safety behavior.

The Limits of Naive Averaging

Before diving into the more sophisticated algorithms, understand why naive averaging fails:

Sign conflicts: When model A increases a weight and model B decreases it, averaging partially cancels both effects. You lose both capabilities instead of gaining both. TIES merging (Lesson 03) addresses this.

Magnitude asymmetry: A model fine-tuned for 10K steps has larger delta weights than one fine-tuned for 1K steps. Simple averaging is dominated by the higher-magnitude model, effectively ignoring the smaller one.

Parameter redundancy: Delta weights often encode the same capability redundantly across many parameters. This redundancy causes interference when multiple models are merged. DARE (Lesson 04) sparsifies delta weights before merging to reduce this.

Layer sensitivity variation: Embedding layers and the LM head are far more sensitive to weight perturbation than middle transformer layers. Uniform merging treats all layers equally, which is suboptimal. Some practitioners use different merge ratios for different layer groups.

Architecture of a Merging Pipeline

Common Mistakes

:::danger Do not merge models with different base checkpoints This is the most common error. If model A derives from Llama-3-8B-v1 and model B derives from Llama-3-8B-v2 - even identical architectures - they are in completely different loss basins. Merging them produces incoherent output. Always verify both models declare the same base in their model card and configuration. Check config.json for the base model name. :::

:::danger Do not merge models with different tokenizers Weight merging assumes the token embedding matrices are aligned. If model A uses Llama's tokenizer and model B uses Mistral's tokenizer, they have different vocabulary sizes and different embedding matrices. There is no sensible way to merge them directly. Tokenizer compatibility is a hard requirement. :::

:::warning Merging cannot inject new knowledge Model merging combines existing capabilities but cannot create new ones. If neither source model knows about your proprietary API's endpoints, no combination of their weights will produce a model that does. Merging is for combining capabilities that already exist in fine-tuned models - not a substitute for training on relevant data. :::

:::warning Evaluate exhaustively, not just on aggregate benchmarks Merged models can excel on aggregate benchmarks while failing on specific tasks or edge cases. A model that merges coding + instruction following might score well on MMLU and HumanEval but fail silently on multi-turn conversations with code blocks. Always evaluate on your actual deployment distribution. :::

Interview Q&A

Q: Why does model merging work at all? What is the geometric argument?

A: Model merging exploits the geometry of the loss landscape. When two models share the same base checkpoint, they are both located within the same broad loss basin - the region of parameter space with low training loss. Within a loss basin, the landscape is approximately convex: the straight-line segment between two low-loss points also passes through relatively low-loss territory. Averaging the weights of two such models produces a parameter configuration on that segment, which therefore also has low loss and good task performance. Two randomly initialized models or models from different pre-training runs occupy different basins; the path between them crosses high-loss regions, so averaging their weights produces a poor model.

Q: What is a task vector and what operations can you perform on it?

A: A task vector is the weight difference between a fine-tuned model and the base model: $\tau = \theta_{fine-tuned} - \theta_{base}$ . It represents the capability delta that fine-tuning added. You can add a task vector to the base model to apply a capability, add multiple task vectors to compose capabilities, subtract a task vector to remove a capability (negation), and scale by a coefficient $\lambda$ to control the strength of an applied capability. This framework from Ilharco et al. 2022 makes model editing via weight arithmetic intuitive and composable.

Q: What are the three conditions for model merging to succeed?

A: First, both models must share the same base checkpoint - same architecture, same pre-training run. Second, they must use the same tokenizer, since different vocabularies make embedding matrices incompatible. Third, fine-tuning should be conservative enough that both models remain in the original loss basin - very long fine-tuning runs with high learning rates can move models far enough from the base that merging degrades.

Q: What was the key result of Wortsman et al.'s model soup paper?

A: Wortsman et al. fine-tuned CLIP with 72 different hyperparameter configurations and averaged the weights of the best-performing fine-tunes. The resulting "model soup" outperformed the best individual model on both in-distribution ImageNet accuracy and out-of-distribution robustness benchmarks (ImageNet-V2, ImageNet-R, ObjectNet). The result was surprising because it showed that weight averaging across multiple fine-tuning trajectories was better than selecting the single best trajectory - diversity in the fine-tuning process, combined with averaging, produces a more robust model than any single run.

Q: Why do simple weight averages fail with sign conflicts, and how does this motivate TIES merging?

A: When model A fine-tunes a weight positively (increases it) and model B fine-tunes the same weight negatively (decreases it), simple averaging returns the weight close to zero - effectively undoing both models' updates. Neither capability benefits. This happens often in practice because different tasks push the same parameters in different directions. TIES merging (Yadav et al. 2023) addresses this by taking a majority vote on the sign of each parameter across all models and only merging parameters that agree in sign. Parameters with sign conflicts are excluded from the merge rather than averaged to near-zero.

Q: How does catastrophic forgetting motivate model merging as an alternative training strategy?

A: Catastrophic forgetting occurs when fine-tuning on task B overwrites weights that encoded task A, degrading task A performance. The standard fix - multi-task training - requires all data to be available simultaneously, which is often impractical. Model merging avoids the problem entirely by never sequentially fine-tuning: you fine-tune separate copies of the base model independently (no forgetting occurs within either), then merge the resulting weight deltas. Each model learns its task in isolation, and the merge combines both capability deltas without any training signal that could cause interference.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Model Merging: TIES, DARE & SLERP demo on the EngineersOfAI Playground - no code required.

:::

The Problem You Will Absolutely Run Into​

Why This Exists - The History of the Problem​

Catastrophic Forgetting Since 1989​

Solutions Before Model Merging​

The Naive Ensemble - And Why It's Too Expensive​

The Geometric Insight That Changes Everything​

Loss Basins and What Fine-Tuning Actually Does​

Mode Connectivity - The Research Foundation​

The Model Soup Paper - The Founding Experiment​

The Greedy Soup Algorithm​

Task Arithmetic - A More Powerful Framework​

The Hugging Face Community Effect​

Applications in Production​

The Limits of Naive Averaging​

Architecture of a Merging Pipeline​

Common Mistakes​

Interview Q&A​