How weight averaging of fine-tuned models produces better, more robust models than any individual fine-tune - and the task arithmetic framework for composing capabilities.

How does weight averaging work in practice?

Linear Interpolation and Model Soup covers model soup, weight averaging, task arithmetic from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/model-merging/linear-interpolation-model-soup

What is the difference between model soup and task arithmetic?

See the full breakdown at https://engineersofai.com/docs/llms/model-merging/linear-interpolation-model-soup

Linear Interpolation and Model Soup

A Strange Result From a Simple Experiment

The year is 2022. You're a researcher at the University of Washington. You've just finished an exhaustive hyperparameter search: 72 fine-tuned versions of CLIP, each with a different combination of learning rate, weight decay, and augmentation strategy. You run all 72 on your validation set and pick the best one. Done. Ship it.

But then a colleague suggests something odd: what if you averaged the weights of the best models? Not voted on their predictions. Not combined their logits. Just literally averaged the numbers in their parameter tensors.

You try it. The averaged model scores higher than the best individual model. It also generalizes better to distribution shifts that none of the individual models were trained for. You run it again to make sure it's not a fluke. It holds.

This is the model soup result - Wortsman et al. 2022 - and it is simultaneously obvious (in hindsight) and completely counterintuitive (at the time). It launched an entire subfield of research and a community of practitioners who spend their days averaging language model weights and watching benchmark numbers climb.

This lesson covers the full theory and practice of linear interpolation as a merging technique: model soup, task arithmetic, and the important extensions that make these simple ideas powerful in practice.

Why Linear Interpolation Works

The Loss Landscape View

Fine-tuning starts from a pre-trained base model - a specific point in high-dimensional parameter space. Pre-training has placed the model in a broad, low-loss region called a loss basin. When you fine-tune with a small learning rate and limited steps, you move within that basin. You don't leap to a different part of the loss landscape.

Two fine-tuned models derived from the same base are therefore both somewhere inside the same loss basin. The key question is: what does the loss landscape look like between them?

Within a loss basin of an over-parameterized model, the landscape is approximately flat - meaning the loss doesn't spike dramatically as you move between two low-loss configurations. This is the empirical observation that makes linear interpolation work. The midpoint between two fine-tuned models, in parameter space, is itself a low-loss configuration.

Why the Average Can Be Better Than Either Source

The averaging result seems paradoxical. How can an average of two models outperform both of them?

The answer lies in variance reduction. Each fine-tuned model captures some signal and some noise - specific adaptations to the random quirks of its training run (mini-batch ordering, weight initialization randomness, hyperparameter choices). When you average multiple models, the noise tends to cancel (different runs make different random errors) while the signal reinforces (all runs learn the true task structure).

This is exactly the intuition behind ensemble methods in classical machine learning. Weight averaging achieves a similar variance-reduction effect - but without the inference cost of running multiple models.

Additionally, different hyperparameter configurations explore different corners of the loss basin. A model trained with high weight decay has moved differently than one trained with low weight decay. Averaging them produces a model in the center of the explored region - a more "average" point in the loss basin that doesn't over-commit to any specific configuration's quirks.

Model Soup - The Full Algorithm

Uniform Soup

The simplest version: average all model weights equally.

$\theta_{soup} = \frac{1}{N} \sum_{i=1}^{N} \theta_i$

This works but is sensitive to the quality distribution of included models. A single poorly-performing model can drag down the average.

Greedy Soup

Wortsman et al.'s recommended algorithm for practical use:

Sort all fine-tuned models by validation accuracy (descending)
Initialize the soup with the best model
For each remaining model (in order), tentatively add it to the soup
Keep it if the new soup improves validation accuracy; discard otherwise
Return the final soup

import copy
import torch
from typing import Callable

def uniform_soup(state_dicts: list[dict]) -> dict:
    """Average all model weights uniformly."""
    result = copy.deepcopy(state_dicts[0])
    n = len(state_dicts)
    for key in result:
        result[key] = result[key].float()
    for sd in state_dicts[1:]:
        for key in result:
            result[key] += sd[key].float()
    for key in result:
        result[key] = (result[key] / n).half()
    return result


def greedy_soup(
    state_dicts: list[dict],
    val_accuracies: list[float],
    evaluate_fn: Callable[[dict], float],
    verbose: bool = True,
) -> tuple[dict, float]:
    """
    Greedy model soup: add models one by one if they improve validation accuracy.

    Parameters
    ----------
    state_dicts       : Model weights, sorted by val_accuracy descending.
    val_accuracies    : Corresponding held-out validation accuracies.
    evaluate_fn       : Function that accepts a state_dict and returns accuracy.

    Returns
    -------
    (soup_state_dict, soup_accuracy)
    """
    # Sort by validation accuracy, descending
    paired = sorted(zip(val_accuracies, state_dicts), key=lambda x: -x[0])
    sorted_accs = [p[0] for p in paired]
    sorted_sds = [p[1] for p in paired]

    # Start with best individual model
    current_soup = {k: v.float().clone() for k, v in sorted_sds[0].items()}
    current_count = 1
    best_acc = sorted_accs[0]

    if verbose:
        print(f"Initial soup: model_0 | val_acc = {best_acc:.4f}")

    for i in range(1, len(sorted_sds)):
        # Compute candidate soup (running average)
        candidate = {}
        for key in current_soup:
            candidate[key] = (
                current_soup[key] * current_count + sorted_sds[i][key].float()
            ) / (current_count + 1)

        candidate_acc = evaluate_fn(candidate)

        if candidate_acc >= best_acc:
            current_soup = candidate
            current_count += 1
            best_acc = candidate_acc
            if verbose:
                delta = candidate_acc - sorted_accs[i]
                print(f"  Accepted model_{i} | soup_acc = {candidate_acc:.4f} | delta vs alone = +{delta:.4f}")
        else:
            if verbose:
                print(f"  Rejected model_{i} | would drop soup_acc to {candidate_acc:.4f}")

    # Convert back to original dtype
    final_soup = {k: v.half() for k, v in current_soup.items()}
    return final_soup, best_acc

When to Use Greedy vs Uniform

Use greedy soup when:

You have models of varying quality (common in hyperparameter search)
Evaluation is cheap relative to the cost of including bad models
You need a guaranteed quality lower bound

Use uniform soup when:

All models were trained with similar, reasonable hyperparameters
You want maximum diversity without the greedy selection overhead
You're averaging checkpoints from a single training run

Linear Interpolation Between Two Models

A simpler form of model soup is a linear interpolation between exactly two models:

$\theta(\alpha) = (1 - \alpha) \cdot \theta_A + \alpha \cdot \theta_B$

Here $\alpha \in [0, 1]$ controls the blend. At $\alpha = 0$ you have model A; at $\alpha = 1$ you have model B; at $\alpha = 0.5$ you have the simple average.

def linear_interpolate(
    state_dict_a: dict,
    state_dict_b: dict,
    alpha: float = 0.5,
) -> dict:
    """
    Linearly interpolate between two models.

    alpha=0.0 -> model_a
    alpha=0.5 -> equal blend
    alpha=1.0 -> model_b
    """
    assert 0.0 <= alpha <= 1.0, "alpha must be in [0, 1]"
    result = {}
    for key in state_dict_a:
        a_val = state_dict_a[key].float()
        b_val = state_dict_b[key].float()
        result[key] = ((1 - alpha) * a_val + alpha * b_val).half()
    return result


def alpha_sweep(
    state_dict_a: dict,
    state_dict_b: dict,
    evaluate_fn: Callable[[dict], float],
    n_steps: int = 11,
) -> list[tuple[float, float]]:
    """
    Evaluate linear interpolations at multiple alpha values.
    Useful for finding the optimal blend ratio.
    """
    results = []
    alphas = [i / (n_steps - 1) for i in range(n_steps)]
    for alpha in alphas:
        blended = linear_interpolate(state_dict_a, state_dict_b, alpha)
        acc = evaluate_fn(blended)
        results.append((alpha, acc))
        print(f"  alpha={alpha:.2f} | accuracy={acc:.4f}")
    return results

# Usage:
# results = alpha_sweep(model_a_weights, model_b_weights, my_eval_fn)
# best_alpha = max(results, key=lambda x: x[1])[0]
# final_model = linear_interpolate(model_a_weights, model_b_weights, best_alpha)

The alpha sweep is a cheap way to find the optimal blend ratio. In practice, the optimal alpha is often not 0.5 - one model's capabilities may dominate at a ratio like 0.3/0.7.

Task Arithmetic - Composing Capabilities

The Task Vector Abstraction

Gabriel Ilharco and colleagues introduced task arithmetic in "Editing Models with Task Arithmetic" (NeurIPS 2023). Rather than interpolating between complete models, task arithmetic decomposes the fine-tuned model into a base + capability delta:

$\tau_A = \theta_A - \theta_{base}$

This task vector $\tau_A$ represents purely what fine-tuning on task A added to the model. Crucially, task vectors are algebraically composable:

Addition - apply task A's capabilities to the base: $\theta_{new} = \theta_{base} + \lambda \cdot \tau_A$

Composition - combine multiple capabilities: $\theta_{new} = \theta_{base} + \lambda_A \cdot \tau_A + \lambda_B \cdot \tau_B$

Negation - remove a capability from the base: $\theta_{new} = \theta_{base} - \lambda \cdot \tau_A$

Analogy - remove capability A, add capability B (style transfer analog): $\theta_{new} = \theta_C - \lambda \cdot \tau_A + \lambda \cdot \tau_B$

from safetensors.torch import load_file, save_file
import torch
from pathlib import Path


class TaskVector:
    """
    Represents the capability delta between a fine-tuned model and its base.

    Based on: Ilharco et al., "Editing Models with Task Arithmetic" (2022)
    """

    def __init__(
        self,
        base_path: str | None = None,
        finetuned_path: str | None = None,
        vector: dict[str, torch.Tensor] | None = None,
    ):
        if vector is not None:
            self.vector = vector
        elif base_path and finetuned_path:
            self.vector = self._compute(base_path, finetuned_path)
        else:
            raise ValueError("Provide either (base_path, finetuned_path) or vector=")

    def _compute(self, base_path: str, finetuned_path: str) -> dict[str, torch.Tensor]:
        base = load_file(base_path)
        finetuned = load_file(finetuned_path)
        vector = {}
        for key in base:
            if key in finetuned:
                vector[key] = finetuned[key].float() - base[key].float()
        return vector

    def __add__(self, other: "TaskVector") -> "TaskVector":
        """Compose two task vectors (combine capabilities)."""
        combined = {}
        all_keys = set(self.vector) | set(other.vector)
        for key in all_keys:
            a = self.vector.get(key, torch.zeros_like(other.vector.get(key)))
            b = other.vector.get(key, torch.zeros_like(self.vector.get(key)))
            combined[key] = a + b
        return TaskVector(vector=combined)

    def __neg__(self) -> "TaskVector":
        """Negate a task vector (remove a capability)."""
        return TaskVector(vector={k: -v for k, v in self.vector.items()})

    def __mul__(self, scalar: float) -> "TaskVector":
        """Scale a task vector (control capability strength)."""
        return TaskVector(vector={k: scalar * v for k, v in self.vector.items()})

    def __rmul__(self, scalar: float) -> "TaskVector":
        return self.__mul__(scalar)

    def apply_to(self, base_path: str, output_path: str, dtype=torch.bfloat16):
        """Apply this task vector to the base model and save."""
        base = load_file(base_path)
        result = {}
        for key in base:
            base_val = base[key].float()
            delta = self.vector.get(key, torch.zeros_like(base_val))
            result[key] = (base_val + delta).to(dtype)
        save_file(result, output_path)
        print(f"Saved merged model to {output_path}")


# ============================================================
# Example: Multi-capability composition
# ============================================================
# tau_code = TaskVector("llama3-base", "llama3-code-finetuned")
# tau_math = TaskVector("llama3-base", "llama3-math-finetuned")
# tau_chat = TaskVector("llama3-base", "llama3-chat-finetuned")
#
# # Combine all three with different strengths
# combined = 0.6 * tau_code + 0.5 * tau_math + 0.7 * tau_chat
# combined.apply_to("llama3-base", "llama3-code-math-chat-merged")
#
# # Negation: remove coding capability from an existing model
# tau_code_neg = -0.4 * tau_code
# tau_code_neg.apply_to("llama3-code-finetuned", "llama3-code-reduced")

The Scaling Factor Lambda

The $\lambda$ coefficient controls how strongly a task vector is applied. It's the single most important hyperparameter in task arithmetic.

Too small ( $\lambda < 0.3$ ): the capability is barely applied; performance on the new task remains low
Optimal ( $\lambda \approx 0.5 - 0.8$ ): good performance on the new task, base capabilities largely preserved
Too large ( $\lambda > 1.0$ ): the model over-corrects toward the new task; base performance degrades

The optimal $\lambda$ depends on the task pair, the magnitude of the task vectors, and the degree of interference between them. Always sweep over $\lambda$ values on a held-out evaluation set.

def lambda_sweep_task_arithmetic(
    base_path: str,
    task_vec: TaskVector,
    evaluate_fn: Callable[[dict], float],
    lambdas: list[float] | None = None,
) -> list[tuple[float, float]]:
    """Find optimal lambda for single task vector application."""
    if lambdas is None:
        lambdas = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]

    base = load_file(base_path)
    results = []

    for lam in lambdas:
        scaled = lam * task_vec
        merged = {}
        for key in base:
            delta = scaled.vector.get(key, torch.zeros(1))
            merged[key] = (base[key].float() + delta).half()

        acc = evaluate_fn(merged)
        results.append((lam, acc))
        print(f"  lambda={lam:.2f} | accuracy={acc:.4f}")

    best_lam, best_acc = max(results, key=lambda x: x[1])
    print(f"\nBest: lambda={best_lam:.2f} | accuracy={best_acc:.4f}")
    return results

Task Vector Negation - Removing Capabilities

One of the most surprising results of task arithmetic is that negation works: subtracting a task vector removes the corresponding capability while largely preserving others.

Ilharco et al. demonstrated this by computing the task vector for sentiment classification, then subtracting it from a multi-task model. The result was a model with significantly reduced sentiment classification accuracy - but without corresponding degradation on other tasks (NLI, paraphrase detection, etc.).

This has practical applications:

Removing toxic behavior patterns from a fine-tuned model
Reducing a model's confidence on certain topic categories
Creating "capability ablations" for research purposes

# Practical negation example
# Scenario: your model has been fine-tuned to be "too helpful" about dangerous topics
# You have a fine-tuned model of that dangerous-topic-only capability
# Negate it to reduce that tendency

# tau_dangerous = TaskVector("llama3-base", "llama3-dangerous-topics")
# tau_safe = -0.5 * tau_dangerous   # subtract the dangerous-topic capability
# tau_safe.apply_to("llama3-instruct", "llama3-instruct-safer")

# Note: This is approximate. Negation reduces a capability but doesn't eliminate it.
# For serious safety applications, use proper safety fine-tuning + RLHF.

Model Soup for Checkpoints - The Temporal Dimension

A less-discussed application of model soup is checkpoint averaging: instead of averaging models from different hyperparameter configurations, you average the last N checkpoints from a single training run.

This is a well-established trick in the NLP community (Polyak averaging) that reduces variance in the final model:

from pathlib import Path

def checkpoint_soup(checkpoint_dir: str, last_n: int = 5) -> dict:
    """
    Average the last N checkpoints from a training run.

    Reduces variance and often improves out-of-distribution performance.
    """
    checkpoints = sorted(Path(checkpoint_dir).glob("checkpoint-*.safetensors"))
    if len(checkpoints) == 0:
        raise ValueError(f"No safetensors checkpoints found in {checkpoint_dir}")

    # Take the last N
    selected = checkpoints[-last_n:]
    print(f"Averaging {len(selected)} checkpoints:")
    for ckpt in selected:
        print(f"  {ckpt.name}")

    # Load and accumulate
    result = None
    for ckpt_path in selected:
        sd = load_file(str(ckpt_path))
        if result is None:
            result = {k: v.float() for k, v in sd.items()}
        else:
            for key in result:
                result[key] += sd[key].float()

    # Divide by count
    n = len(selected)
    final = {k: (v / n).bfloat16() for k, v in result.items()}
    return final

# The last-5-checkpoints soup often gives a free 0.5-1.0% accuracy improvement
# over the single best checkpoint, especially for smaller datasets.

When Linear Interpolation Works and When It Doesn't

Conditions for Success

Failure Modes

Sign conflicts dominate: When task A increases a weight and task B decreases it, the average is near zero - neither capability is well-represented. This is endemic to multi-task merging of divergent domains. Solution: TIES merging.

Large delta magnitudes: Models fine-tuned for many steps have large task vectors. When averaged with models that have small vectors, the large-delta model dominates. The other models' contributions are washed out. Solution: normalize delta magnitudes before merging, or use DARE to sparsify.

Different architecture variants: Even if models are "both Llama-3-8B," if one uses a sliding window attention modification or a different RoPE scaling, their weights are not directly comparable at the layers that differ.

Production Engineering Notes

:::tip Use safetensors format for merging The safetensors format from Hugging Face is dramatically faster for loading large tensors than pytorch_bin format. When merging large models (13B+), the I/O time dominates. Always convert models to safetensors before merging. :::

:::note CPU merging is feasible for models up to 13B You don't need a GPU to merge models - you just need RAM. A 7B model in BF16 requires ~14GB of RAM. Merging two 7B models requires about 42GB peak (base + two models loaded simultaneously, if you load all at once). MergeKit (Lesson 06) implements lazy layer-by-layer loading that keeps peak RAM near 2× model size. :::

:::tip Layer-wise alpha variation Not all layers should use the same interpolation ratio. Embedding and LM head layers (the "boundary" layers) are more sensitive to changes than middle transformer layers. Consider sweeping alpha separately per layer group: embeddings, early/middle/late attention+MLP, LM head. :::

Common Mistakes

:::danger Don't use alpha=0.5 as a default without sweeping The optimal blend ratio for a specific task pair is almost never exactly 0.5. Always sweep alpha over [0.3, 0.4, 0.5, 0.6, 0.7] on a held-out set. The difference between alpha=0.5 and alpha=0.65 can be 2-3 points on a benchmark. :::

:::warning Don't merge without a held-out evaluation set Model merging requires iterative experimentation - you need to evaluate every candidate configuration. If you don't have a held-out evaluation set that covers both tasks you're merging, you're flying blind. Your validation set should sample from all task distributions you care about. :::

:::warning Simple averaging degrades with more than 3-4 models The more models you average, the more the individual fine-tuning signals get diluted. Adding a fifth model that contributes 20% to the average makes each original model contribute only 16%. For merging more than 3-4 models, TIES and DARE (Lessons 03-04) are significantly more reliable. :::

Interview Q&A

Q: Explain linear interpolation of model weights and why the midpoint model can outperform both endpoints.

A: Linear interpolation computes $\theta(\alpha) = (1-\alpha)\theta_A + \alpha\theta_B$ for some blend ratio $\alpha \in [0,1]$ . The midpoint model ( $\alpha=0.5$ ) can outperform both source models because of variance reduction: each fine-tuned model captures true signal plus noise from its specific training trajectory (random mini-batch ordering, hyperparameter quirks, etc.). When you average two models, the noises tend to cancel while the signals reinforce. The resulting model is in a more central, stable part of the loss basin - less over-fit to any specific training path's idiosyncrasies.

Q: What is task arithmetic and what makes it more powerful than simple weight averaging?

A: Task arithmetic, introduced by Ilharco et al. (2022), represents each fine-tuned model as a base model plus a task vector: $\tau = \theta_{finetuned} - \theta_{base}$ . This decomposition enables four operations: addition (apply a capability), composition (combine multiple task vectors), negation (remove a capability), and scaling (control capability strength). Simple weight averaging is a special case: averaging two models is equivalent to applying each model's task vector with $\lambda=0.5$ . Task arithmetic is more powerful because it enables negation (impossible with positive-only averaging), fine-grained scaling of individual task contributions, and composition of more than two capabilities without a common reference point.

Q: When would you use greedy soup versus uniform soup?

A: Use greedy soup when you have models of varying quality - for example, from a hyperparameter sweep where some configurations are clearly better than others. Greedy soup guarantees the result is at least as good as the best individual model by only accepting models that improve the held-out metric. Use uniform soup when all models have similar quality and you want maximum diversity (e.g., averaging checkpoints from a single training run with slowly changing learning rate). Uniform soup is simpler and doesn't require evaluation during construction but has no quality guarantee.

Q: What is the lambda scaling factor in task arithmetic and how do you tune it?

A: Lambda ( $\lambda$ ) scales the task vector before applying it to the base model: $\theta_{new} = \theta_{base} + \lambda \cdot \tau$ . A larger $\lambda$ applies the capability more strongly but risks degrading base model performance. The optimal $\lambda$ varies by task pair and model - common sweet spots are 0.4–0.8. Tune it by sweeping over candidate values (e.g., 0.1, 0.2, ..., 1.0) on a held-out evaluation set that measures both the new capability and the capabilities you want to preserve.

Q: Why does task vector negation work, and what are its limitations?

A: Negation works because task vectors approximately encode separable capability representations in weight space. Subtracting a capability's task vector moves the model in the direction that reduces that capability's representation. Ilharco et al. demonstrated this on classification tasks: subtracting the task vector for sentiment classification significantly reduced sentiment accuracy while largely preserving performance on unrelated tasks (NLI, paraphrase detection). Limitations: negation is approximate - it reduces a capability but doesn't fully eliminate it, especially for capabilities that are deeply entangled with the model's general language understanding. For complete capability removal, you need more targeted approaches.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Model Soup: Weight Averaging demo on the EngineersOfAI Playground - no code required.

:::

A Strange Result From a Simple Experiment​

Why Linear Interpolation Works​

The Loss Landscape View​

Why the Average Can Be Better Than Either Source​

Model Soup - The Full Algorithm​

Uniform Soup​

Greedy Soup​

When to Use Greedy vs Uniform​

Linear Interpolation Between Two Models​

Task Arithmetic - Composing Capabilities​

The Task Vector Abstraction​

The Scaling Factor Lambda​

Task Vector Negation - Removing Capabilities​

Model Soup for Checkpoints - The Temporal Dimension​

When Linear Interpolation Works and When It Doesn't​

Conditions for Success​

Failure Modes​

Production Engineering Notes​

Common Mistakes​

Interview Q&A​