Merging and Model Soup Techniques

Three Models Walk Into a Server Room

It started as a resource problem. A team at a mid-size fintech had three fine-tuned models: one for financial document summarization, one for SQL generation from natural language, and one for regulatory compliance question answering. All three were fine-tuned from the same Mistral-7B base. All three performed excellently at their own task. All three needed to be served 24/7.

Running three separate inference servers was too expensive. The team tried the obvious solution: prompt engineering on a single base model. Accuracy collapsed. The base model had no domain knowledge. Then they tried multi-task fine-tuning - training a single model on all three datasets simultaneously. Three weeks of iteration. The model learned a poor compromise: mediocre at summarization, acceptable at SQL, confused about compliance. Multi-task training is harder than it looks.

Then someone on the team read a paper from Ilharco et al. (2022) about "task vectors" and tried something strange: they took the three fine-tuned model checkpoints, computed the weight differences from the base model, added those differences together with appropriate scaling, and applied the sum back to the base model. The resulting merged model handled all three tasks at 94% of the quality of the specialized models. No additional training. No GPU time. Just arithmetic on weight tensors.

This is model merging - one of the most underrated techniques in applied LLM engineering. It works because neural network fine-tuning lives in a low-dimensional subspace of the parameter space. When two models are fine-tuned from the same base, the paths they take through weight space tend to be nearly orthogonal - they do not interfere much. Adding them together gives you a model that has traveled both paths simultaneously.

That claim deserves skepticism. And it does not always work - when tasks conflict, merging produces garbage. But understanding when it works, when it fails, and which merging algorithm to use for which situation is now a genuine production skill. The open-source community has produced a proliferation of merging methods in 2023-2024, and MergeKit has become the standard library. This lesson maps the whole landscape.

Why This Exists - The Cost of Specialization

The core tension in fine-tuning is that specialization costs something. A model fine-tuned deeply on code becomes worse at creative writing. A model fine-tuned on medical text loses some general reasoning ability. This is not always catastrophic - if you only need the model for one task, a little forgetting is acceptable. But in production, "one task" is rarely the reality.

Before model merging, teams had three options, all painful:

Option 1: Multiple specialized models. Run N inference servers for N tasks. Cost scales linearly. Routing logic required. Memory usage enormous. Works, but expensive.

Option 2: Multi-task training. Train one model on all tasks simultaneously. Requires careful dataset mixing ratios (get it wrong and one task dominates), longer training runs, and the final model is often worse on each individual task than a specialist. The negative transfer problem is real.

Option 3: Prompt engineering. Use one strong base model with task-specific prompts. Works for tasks within the base model's capability range, fails for specialized domains that require knowledge the base model never saw.

Model merging offers a fourth path: train specialists separately (easy, well-understood, cheap to iterate) and then combine them into a single model that retains most of each specialist's capability. No routing infrastructure needed. No multi-task training complexity. One model in production.

The theoretical foundation comes from two insights. First, fine-tuning from a common base tends to produce models in the same "loss basin" of the parameter space - they are nearby in weight space, connected by low-loss regions (Frankle et al., 2020; Entezari et al., 2021). Second, the weight changes caused by fine-tuning tend to be low-rank and task-specific. Combining them additively works better than you would expect from random weight space intuition.

Historical Context - From Soup to Surgery

The story of model merging began quietly in 2022 with two papers that arrived from different directions and converged on the same idea.

Model Soup (Wortsman et al., 2022) was the first to show that averaging the weights of multiple models fine-tuned from the same base checkpoint could improve accuracy and robustness. They fine-tuned CLIP variants with different hyperparameters and found that the simple average of the resulting weights outperformed any individual model on out-of-distribution benchmarks. The intuition: different hyperparameter runs explore different regions of the loss basin, and the average sits in a flatter, more robust region.

Task Vectors (Ilharco et al., 2022) was the conceptual breakthrough. They defined a "task vector" as $\tau = \theta_{ft} - \theta_{base}$ : the difference between fine-tuned weights and base weights. They showed this vector has semantic meaning - you can add it, subtract it, and combine it. Want a model that can do task A but not task B? Compute $\theta_{base} + \tau_A - \tau_B$ . Want a model that does both? $\theta_{base} + \tau_A + \tau_B$ (with appropriate scaling). This was the "aha moment" that made the community realize weight arithmetic was a principled operation, not a hack.

TIES-Merging (Yadav et al., 2023) addressed a critical failure mode: when task vectors conflict. If two fine-tuned models change the same weight in opposite directions (one increases it, one decreases it), simple addition produces poor results. TIES introduced trimming (discard small changes as noise), election (resolve sign conflicts by majority vote), and disjoint merging (only apply each task vector to the parameters it "won" the election for).

DARE (Yu et al., 2023) took a different approach to the conflict problem: randomly drop most of the task vector's weights before merging. The intuition is that fine-tuned models have many redundant weight changes, and randomly zeroing most of them (then rescaling to preserve the expected value) reduces interference without carefully analyzing which changes conflict.

SLERP (popularized for model merging in 2023) adapted the spherical linear interpolation technique from quaternion animation to model weight vectors, enabling smooth interpolation between two fine-tuned models that preserves the "norm" of the weight vector better than simple linear interpolation.

Frankenmerging (community-driven, 2023-2024) emerged from the open-source community: mixing layers from entirely different model checkpoints. Layer 0-16 from model A, layers 17-32 from model B. No theoretical justification - pure empirical experimentation. Some of the highest-rated models on the Open LLM Leaderboard in early 2024 were frankenmerges.

Core Concepts

LoRA Adapter Merging - The Simple Case

Before merging full models, let's understand the simpler case: merging a LoRA adapter back into the base model.

Recall that a LoRA adapter defines a weight update $\Delta W = BA$ where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times d}$ . The modified forward pass is:

$h = (W + \Delta W)x = Wx + BAx$

Merging the adapter permanently changes the base weights:

$W_{merged} = W + \frac{\alpha}{r} BA$

where $\alpha$ is the LoRA scaling factor. After this merge, the model has no adapter - it has a new base model that behaves as if the adapter were always present. This is merge_and_unload() in PEFT.

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load base model and LoRA adapter
base_model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    torch_dtype=torch.float16,
    device_map="auto",
)
peft_model = PeftModel.from_pretrained(
    base_model,
    "./my-lora-adapter",
)

# Merge adapter weights permanently into base model
merged_model = peft_model.merge_and_unload()

# Save the merged model - now a standalone model with no adapter dependency
merged_model.save_pretrained("./mistral-7b-merged-task")

# Verify - the saved model loads without PEFT
standalone = AutoModelForCausalLM.from_pretrained("./mistral-7b-merged-task")

The merged model is identical in architecture to the base model. It loads without PEFT installed. It runs at the same speed as the base model (the low-rank computation is now baked into the weight matrices).

Combining Multiple LoRA Adapters - add_weighted_adapter

PEFT's add_weighted_adapter method allows combining multiple LoRA adapters without full merging. This is the LoRA-level equivalent of task vector arithmetic.

The combination is:

$\Delta W_{combined} = \sum_i w_i \Delta W_i = \sum_i w_i B_i A_i$

where $w_i$ are user-specified weights. This is linear interpolation in the space of weight updates.

from peft import PeftModel
from transformers import AutoModelForCausalLM
import torch

base_model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    torch_dtype=torch.float16,
    device_map="auto",
)

# Load multiple adapters
model = PeftModel.from_pretrained(
    base_model,
    "./coding-lora",
    adapter_name="coding",
)
model.load_adapter("./instruction-lora", adapter_name="instruction")
model.load_adapter("./math-lora", adapter_name="math")

# Combine all three adapters with equal weight
model.add_weighted_adapter(
    adapters=["coding", "instruction", "math"],
    weights=[0.4, 0.4, 0.2],     # coding + instruction emphasis, light math
    adapter_name="combined",
    combination_type="linear",    # simple weighted sum
)

# Switch to the combined adapter
model.set_adapter("combined")

# Or use TIES combination (resolves sign conflicts)
model.add_weighted_adapter(
    adapters=["coding", "instruction", "math"],
    weights=[0.4, 0.4, 0.2],
    adapter_name="combined-ties",
    combination_type="ties",
    density=0.2,                  # keep top 20% of weights by magnitude
)

# Or use DARE combination (random dropping)
model.add_weighted_adapter(
    adapters=["coding", "instruction", "math"],
    weights=[1.0, 1.0, 0.5],
    adapter_name="combined-dare",
    combination_type="dare_linear",
    density=0.25,                 # keep 25% of weights, rescale by 1/density
)

The combination_type parameter is where the different merging algorithms live. "linear" is simple weighted sum - fast but susceptible to sign conflicts. "ties" and "dare_linear" are more robust for adapters trained on different tasks.

Task Vectors - Weight Space Arithmetic

The task vector framework (Ilharco et al., 2022) formalizes what add_weighted_adapter does intuitively. A task vector is:

$\tau_i = \theta_{ft,i} - \theta_{base}$

The merged model is:

$\theta_{merged} = \theta_{base} + \lambda \sum_i \tau_i$

where $\lambda$ is a global scaling factor (typically 0.3-0.7) that controls how much of the fine-tuning is applied.

import torch
from transformers import AutoModelForCausalLM

def load_weights(path: str) -> dict:
    """Load model state dict."""
    model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.float32)
    return {k: v.clone() for k, v in model.state_dict().items()}

def task_vector_merge(
    base_path: str,
    finetuned_paths: list,
    weights: list,
    output_path: str,
    scaling_coefficient: float = 0.5,
):
    """
    Merge multiple fine-tuned models using task vector arithmetic.
    All models must have the same architecture and be fine-tuned
    from the same base checkpoint.
    """
    print("Loading base model weights...")
    base_weights = load_weights(base_path)

    print("Computing task vectors...")
    task_vectors = []
    for path in finetuned_paths:
        ft_weights = load_weights(path)
        vector = {k: ft_weights[k] - base_weights[k] for k in base_weights}
        task_vectors.append(vector)

    print("Merging task vectors...")
    merged_vector = {k: torch.zeros_like(v) for k, v in base_weights.items()}
    for tv, w in zip(task_vectors, weights):
        for k in merged_vector:
            merged_vector[k] += w * tv[k]

    print(f"Applying merged vector with scaling coefficient {scaling_coefficient}...")
    merged_weights = {
        k: base_weights[k] + scaling_coefficient * merged_vector[k]
        for k in base_weights
    }

    # Save merged model
    base_model = AutoModelForCausalLM.from_pretrained(base_path, torch_dtype=torch.float32)
    base_model.load_state_dict(merged_weights)
    base_model.save_pretrained(output_path)
    print(f"Merged model saved to {output_path}")

# Example usage
task_vector_merge(
    base_path="mistralai/Mistral-7B-v0.1",
    finetuned_paths=[
        "./mistral-coding-ft",
        "./mistral-instruction-ft",
    ],
    weights=[0.5, 0.5],
    output_path="./mistral-merged",
    scaling_coefficient=0.5,
)

SLERP - Spherical Linear Interpolation

Linear interpolation between two model weight vectors can produce suboptimal results when the vectors have very different magnitudes. SLERP interpolates along the surface of a sphere, preserving the vector norm throughout the interpolation path.

For two weight tensors $v_0$ and $v_1$ at interpolation parameter $t \in [0, 1]$ :

$\text{SLERP}(v_0, v_1, t) = \frac{\sin((1-t)\Omega)}{\sin(\Omega)} v_0 + \frac{\sin(t\Omega)}{\sin(\Omega)} v_1$

where $\Omega = \arccos\left(\frac{v_0 \cdot v_1}{|v_0| |v_1|}\right)$ is the angle between the vectors.

SLERP is only defined for two models - it cannot directly merge three or more. For multiple models, you need to nest SLERP calls: merge A and B first, then merge the result with C.

import torch
import numpy as np

def slerp(v0: torch.Tensor, v1: torch.Tensor, t: float) -> torch.Tensor:
    """
    Spherical linear interpolation between two tensors.
    t=0 returns v0, t=1 returns v1.
    """
    # Flatten for dot product computation
    v0_flat = v0.float().flatten()
    v1_flat = v1.float().flatten()

    # Compute cosine of angle between vectors
    cos_omega = torch.dot(v0_flat, v1_flat) / (
        torch.norm(v0_flat) * torch.norm(v1_flat) + 1e-8
    )
    cos_omega = torch.clamp(cos_omega, -1.0, 1.0)

    # If vectors are nearly parallel, fall back to linear interpolation
    if torch.abs(cos_omega) > 0.9995:
        return (1 - t) * v0 + t * v1

    omega = torch.acos(cos_omega)
    sin_omega = torch.sin(omega)

    coeff0 = torch.sin((1 - t) * omega) / sin_omega
    coeff1 = torch.sin(t * omega) / sin_omega

    result = coeff0 * v0.float() + coeff1 * v1.float()
    return result.to(v0.dtype)


def slerp_merge_models(
    model_a_path: str,
    model_b_path: str,
    t: float,
    output_path: str,
):
    """
    Merge two fine-tuned models using SLERP at interpolation point t.
    t=0.0 -> model A, t=1.0 -> model B, t=0.5 -> midpoint.
    """
    from transformers import AutoModelForCausalLM

    model_a = AutoModelForCausalLM.from_pretrained(model_a_path, torch_dtype=torch.float32)
    model_b = AutoModelForCausalLM.from_pretrained(model_b_path, torch_dtype=torch.float32)

    state_a = model_a.state_dict()
    state_b = model_b.state_dict()

    merged_state = {}
    for key in state_a:
        if state_a[key].shape != state_b[key].shape:
            raise ValueError(f"Shape mismatch at layer {key}")

        # Apply SLERP per-layer
        merged_state[key] = slerp(state_a[key], state_b[key], t)

    model_a.load_state_dict(merged_state)
    model_a.save_pretrained(output_path)
    print(f"SLERP merge (t={t}) saved to {output_path}")


# Merge a coding model and instruction model at t=0.5 (equal blend)
slerp_merge_models(
    model_a_path="./mistral-coding-ft",
    model_b_path="./mistral-instruction-ft",
    t=0.5,
    output_path="./mistral-slerp-merged",
)

SLERP vs linear interpolation: SLERP tends to produce better results when the two models have diverged significantly from the base (high magnitude task vectors). Linear interpolation "cuts through" the interior of the weight space sphere; SLERP stays on the surface where models of similar capability tend to cluster.

TIES-Merging - Resolving Interference

TIES (Trim, Elect Sign, Disjoint Merge) addresses the fundamental problem with simple task vector addition: when $\tau_A$ and $\tau_B$ change the same parameter in opposite directions, adding them cancels both changes and produces a model worse than either fine-tuned model.

The algorithm has three steps:

Step 1 - Trim: For each task vector, keep only the top- $k\%$ of parameters by absolute magnitude. Small changes are likely noise from the optimization process, not meaningful signal. Setting $k = 20$ means keeping only the 20% largest-magnitude weight changes per task vector.

Step 2 - Elect Sign: For each parameter $i$ , compute the "consensus sign" across all task vectors that have a non-zero value at position $i$ . The consensus sign is the sign with the greater total magnitude:

$\gamma_i = \text{sign}\left(\sum_{t} \tau_{t,i}\right)$

Step 3 - Disjoint Merge: Only include a task vector's contribution to parameter $i$ if that contribution agrees with the consensus sign $\gamma_i$ . Discard contributions that conflict:

$\tau_{merged,i} = \frac{\sum_{t: \text{sign}(\tau_{t,i}) = \gamma_i} \tau_{t,i}}{|\{t: \tau_{t,i} \neq 0\}|}$

The denominator normalizes by the number of non-zero contributors (not just agreeing ones), preventing scale explosion.

import torch

def ties_merge(
    base_weights: dict,
    finetuned_weights_list: list,
    density: float = 0.2,
    scaling_coefficient: float = 0.5,
) -> dict:
    """
    TIES-Merging: Trim, Elect Sign, Disjoint Merge.

    Args:
        base_weights: State dict of the base model
        finetuned_weights_list: List of fine-tuned model state dicts
        density: Fraction of top-magnitude weights to keep per task vector
        scaling_coefficient: Final scaling of the merged task vector
    """
    # Compute task vectors
    task_vectors = []
    for ft_weights in finetuned_weights_list:
        tv = {k: ft_weights[k].float() - base_weights[k].float()
              for k in base_weights}
        task_vectors.append(tv)

    # Step 1: Trim - keep only top density% of weights by magnitude
    trimmed_vectors = []
    for tv in task_vectors:
        trimmed = {}
        for k in tv:
            tensor = tv[k]
            if tensor.numel() == 0:
                trimmed[k] = tensor
                continue
            # Compute threshold for top density% by magnitude
            threshold = torch.quantile(
                tensor.abs().float(),
                1.0 - density
            )
            mask = tensor.abs() >= threshold
            trimmed[k] = tensor * mask.float()
        trimmed_vectors.append(trimmed)

    # Step 2 & 3: Elect sign and disjoint merge
    merged_vector = {}
    for k in base_weights:
        # Stack all task vectors for this parameter
        stacked = torch.stack([tv[k].float() for tv in trimmed_vectors], dim=0)

        # Elect sign: sum all contributions, take sign
        elected_sign = torch.sign(stacked.sum(dim=0))

        # Disjoint merge: count non-zero contributors
        nonzero_mask = (stacked != 0).float()
        num_nonzero = nonzero_mask.sum(dim=0).clamp(min=1)

        # Keep only contributions matching elected sign
        sign_match = (torch.sign(stacked) == elected_sign.unsqueeze(0)).float()
        merged = (stacked * sign_match).sum(dim=0) / num_nonzero

        merged_vector[k] = merged

    # Apply to base weights
    merged_weights = {
        k: base_weights[k].float() + scaling_coefficient * merged_vector[k]
        for k in base_weights
    }
    return merged_weights

DARE - Random Weight Dropping

DARE (Yu et al., 2023) takes a probabilistic approach to reducing interference. The observation: fine-tuned models have many small, redundant weight changes. If you randomly zero out most of the task vector (keeping fraction $p$ ) and rescale by $1/p$ to preserve the expected value, the remaining weights capture the essential information while reducing overlap with other task vectors.

$\tau_{DARE,i} = \begin{cases} \tau_i / p & \text{with probability } p \\ 0 & \text{with probability } 1-p \end{cases}$

DARE with $p = 0.1$ keeps 10% of the task vector's weights and scales them up by 10x. This is inspired by dropout theory: a randomly pruned, rescaled network approximates the full network in expectation.

def dare_drop(
    task_vector: dict,
    density: float = 0.1,
    seed: int = 42,
) -> dict:
    """
    DARE: randomly drop weights from a task vector, rescale survivors.

    Args:
        task_vector: Dict of {layer_name: weight_delta_tensor}
        density: Fraction of weights to keep (e.g. 0.1 = keep 10%)
        seed: Random seed for reproducibility
    """
    torch.manual_seed(seed)
    dropped = {}
    for k, tensor in task_vector.items():
        if tensor.numel() == 0:
            dropped[k] = tensor
            continue
        # Bernoulli mask: 1 with probability density, 0 otherwise
        mask = torch.bernoulli(torch.full_like(tensor.float(), density))
        # Scale survivors by 1/density to preserve expected value
        dropped[k] = (tensor.float() * mask) / density
    return dropped


def dare_merge(
    base_weights: dict,
    finetuned_weights_list: list,
    weights: list,
    density: float = 0.1,
    scaling_coefficient: float = 1.0,
) -> dict:
    """
    DARE merge: apply DARE to each task vector, then average.
    """
    task_vectors = [
        {k: fw[k].float() - base_weights[k].float() for k in base_weights}
        for fw in finetuned_weights_list
    ]

    dare_vectors = [
        dare_drop(tv, density=density, seed=i)
        for i, tv in enumerate(task_vectors)
    ]

    # Weighted average of DARE-dropped vectors
    merged_vector = {k: torch.zeros_like(v.float()) for k, v in base_weights.items()}
    for dv, w in zip(dare_vectors, weights):
        for k in merged_vector:
            merged_vector[k] += w * dv[k]

    # Normalize by sum of weights
    weight_sum = sum(weights)
    merged_weights = {
        k: base_weights[k].float() + scaling_coefficient * merged_vector[k] / weight_sum
        for k in base_weights
    }
    return merged_weights

MergeKit - Production Model Merging

MergeKit (Goddard et al., 2024) is the production library for all of the above. It handles: loading large models in pieces (shard-by-shard merging to avoid OOM), YAML-driven configuration, support for SLERP, TIES, DARE, and linear merging, and direct upload to Hugging Face Hub.

# mergekit-config.yaml - TIES merge of coding and instruction models
merge_method: ties
base_model: mistralai/Mistral-7B-v0.1
models:
  - model: ./mistral-coding-ft
    parameters:
      weight: 0.5
      density: 0.2    # keep top 20% by magnitude
  - model: ./mistral-instruction-ft
    parameters:
      weight: 0.5
      density: 0.2
parameters:
  normalize: true     # normalize merged task vector by number of contributors
  int8_mask: true     # use int8 for intermediate computations (memory saving)
dtype: float16

# Install MergeKit
pip install mergekit

# Run merge from YAML config
mergekit-yaml mergekit-config.yaml ./merged-model \
    --cuda \
    --allow-crimes \
    --out-shard-size 5B \
    --lazy-unpickle

# MergeKit also has a Python API for programmatic merging
from mergekit.config.main import MergeConfiguration
from mergekit.merge import MergeOptions, run_merge

# SLERP config between two models
slerp_config = MergeConfiguration.model_validate({
    "merge_method": "slerp",
    "base_model": "mistralai/Mistral-7B-v0.1",
    "models": [
        {
            "model": "./mistral-coding-ft",
            "parameters": {"t": [
                {"filter": "self_attn", "value": 0.5},
                {"filter": "mlp", "value": 0.3},
                {"value": 0.5}    # default for all other layers
            ]}
        },
        {"model": "./mistral-instruction-ft"}
    ],
    "dtype": "float16",
})

run_merge(
    slerp_config,
    out_path="./slerp-merged",
    options=MergeOptions(
        cuda=True,
        copy_tokenizer=True,
        lazy_unpickle=True,    # load shards one at a time - saves RAM
        low_cpu_memory=True,
    ),
)

MergeKit's layer-wise parameters are its most powerful feature. You can specify different interpolation strengths for attention layers vs MLP layers vs embedding layers. This is how experienced merge practitioners tune: attention layers often benefit from more conservative blending (lower $t$ or higher density), while MLP layers can absorb more aggressive mixing.

Frankenmerges - Layer Grafting

Frankenmerging cuts models at the layer boundary and stitches them together. A 32-layer model split at layer 16: layers 0-15 from model A, layers 16-31 from model B.

# mergekit-config.yaml - frankenmerge
merge_method: passthrough
slices:
  - sources:
      - model: ./mistral-reasoning-ft
        layer_range: [0, 16]   # use first 16 layers from reasoning model
  - sources:
      - model: ./mistral-creative-ft
        layer_range: [16, 32]  # use last 16 layers from creative model
dtype: float16

# Python equivalent for a custom frankenmerge
import torch
from transformers import AutoModelForCausalLM, AutoConfig

def frankenmerge(
    model_a_path: str,
    model_b_path: str,
    split_layer: int,
    output_path: str,
):
    """
    Create a frankenmerge: layers 0..split_layer-1 from model_a,
    layers split_layer..N from model_b.
    Both models must have identical architectures.
    """
    model_a = AutoModelForCausalLM.from_pretrained(
        model_a_path, torch_dtype=torch.float16
    )
    model_b = AutoModelForCausalLM.from_pretrained(
        model_b_path, torch_dtype=torch.float16
    )

    state_a = model_a.state_dict()
    state_b = model_b.state_dict()

    merged_state = {}
    for key in state_a:
        # Parse layer number from key (e.g. "model.layers.15.self_attn.q_proj.weight")
        parts = key.split(".")
        layer_num = None
        for i, part in enumerate(parts):
            if part == "layers" and i + 1 < len(parts):
                try:
                    layer_num = int(parts[i + 1])
                except ValueError:
                    pass
                break

        if layer_num is not None and layer_num >= split_layer:
            merged_state[key] = state_b[key]
        else:
            merged_state[key] = state_a[key]

    model_a.load_state_dict(merged_state)
    model_a.save_pretrained(output_path)
    print(f"Frankenmerge saved to {output_path}")


frankenmerge(
    model_a_path="./mistral-reasoning-ft",
    model_b_path="./mistral-creative-ft",
    split_layer=16,
    output_path="./mistral-franken",
)

Frankenmerging works because transformer layers learn hierarchical representations. Early layers handle syntax, factual recall, and basic semantics. Later layers handle task-specific reasoning and generation style. Combining the early layers of a model strong in world knowledge with the late layers of a model strong in instruction following can produce a model that combines both strengths - if the two models were trained from the same base and their representations at the split point are compatible.

The risk: if the representations at the split point are too divergent (models trained with very different data distributions), the late layers of model B will receive activations they were never trained to handle, producing incoherent outputs. This is why frankenmerging works better with models fine-tuned from the same base checkpoint.

Architecture Diagram - Merging Methods Compared

When Merging Beats Ensembling

Ensembling runs multiple models and combines their output probabilities. It is the gold standard for accuracy but has high inference cost: $N$ models at inference time costs $N$ x compute.

Merging produces one model. Same inference cost as a single model. The quality comparison:

Scenario	Ensemble	Merge
Tasks are similar (same domain)	+3-5% vs single model	-1-3% vs ensemble
Tasks conflict (different domains)	+5-10% vs single model	-8-15% vs ensemble
Inference budget is fixed	Not feasible (Nx cost)	Feasible (1x cost)
Storage budget is fixed	Nx storage for models	1x storage
Latency requirement is strict	Fails (parallel needed)	Passes (single model)

The conclusion: merging wins when inference cost is the constraint. Ensembling wins when you have the compute budget and tasks are sufficiently different that a single merged model cannot represent both well.

There is a specific scenario where merging strictly dominates: when the individual task vectors are nearly orthogonal (tasks do not conflict). In this case, the merged model achieves near-ensemble quality at 1x inference cost. Ilharco et al. showed this empirically for image classification tasks: merging 8 CLIP classifiers produced a single model that matched ensemble performance on 7 of the 8 tasks.

Practical: Full Merging Pipeline

# Complete pipeline: fine-tune two LoRA adapters, merge them, evaluate
from peft import LoraConfig, get_peft_model, PeftModel
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
)
from trl import SFTTrainer
from datasets import load_dataset
import torch

BASE_MODEL = "mistralai/Mistral-7B-v0.1"

def fine_tune_lora(
    dataset_name: str,
    output_dir: str,
    task_description: str,
):
    """Fine-tune a LoRA adapter on a specific dataset."""
    model = AutoModelForCausalLM.from_pretrained(
        BASE_MODEL,
        load_in_4bit=True,
        torch_dtype=torch.float16,
        device_map="auto",
    )
    tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
    tokenizer.pad_token = tokenizer.eos_token

    lora_config = LoraConfig(
        r=16,
        lora_alpha=32,
        target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
        lora_dropout=0.05,
        task_type="CAUSAL_LM",
    )
    model = get_peft_model(model, lora_config)

    dataset = load_dataset(dataset_name, split="train[:5000]")

    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=3,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=50,
        save_strategy="epoch",
        report_to="none",
    )

    trainer = SFTTrainer(
        model=model,
        args=training_args,
        train_dataset=dataset,
        dataset_text_field="text",
        max_seq_length=1024,
    )
    trainer.train()
    model.save_pretrained(output_dir)
    print(f"{task_description} adapter saved to {output_dir}")


# Train two specialized adapters
fine_tune_lora("HuggingFaceH4/CodeAlpaca_20K", "./coding-adapter", "Coding")
fine_tune_lora("HuggingFaceH4/ultrachat_200k", "./instruction-adapter", "Instruction")

# Merge using add_weighted_adapter
base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    torch_dtype=torch.float16,
    device_map="auto",
)
merged = PeftModel.from_pretrained(base_model, "./coding-adapter", adapter_name="coding")
merged.load_adapter("./instruction-adapter", adapter_name="instruction")

# Try three merging strategies
strategies = [
    ("linear", {"combination_type": "linear"}),
    ("ties", {"combination_type": "ties", "density": 0.2}),
    ("dare", {"combination_type": "dare_linear", "density": 0.15}),
]

for strategy_name, kwargs in strategies:
    merged.add_weighted_adapter(
        adapters=["coding", "instruction"],
        weights=[0.5, 0.5],
        adapter_name=f"combined_{strategy_name}",
        **kwargs,
    )
    merged.set_adapter(f"combined_{strategy_name}")

    # Quick evaluation
    tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
    test_prompt = "Write a Python function to compute Fibonacci numbers:"
    inputs = tokenizer(test_prompt, return_tensors="pt").to("cuda")
    with torch.no_grad():
        outputs = merged.generate(**inputs, max_new_tokens=100, temperature=0.1)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(f"\n=== Strategy: {strategy_name} ===")
    print(response[:300])

Production Engineering Notes

Memory-Efficient Merging with MergeKit

The naive approach to merging loads all models into GPU memory simultaneously. For 7B models in float16, each model is ~14GB. Merging three 7B models would require 42GB just for loading, plus overhead.

MergeKit's lazy_unpickle and shard-by-shard processing allows merging models that do not fit in memory:

# Merge 70B models on a machine with 32GB RAM
# (no GPU required for merging - it is pure arithmetic)
mergekit-yaml ties-config.yaml ./merged-70b \
    --lazy-unpickle \           # load one shard at a time
    --allow-crimes \            # allow mixing model types
    --out-shard-size 5B \       # output shard size in parameters
    --copy-tokenizer            # copy tokenizer from base model

The --lazy-unpickle flag processes one attention or MLP block at a time, keeping peak memory proportional to one layer's weights rather than one full model.

Evaluating Merged Models

A merged model can degrade silently on tasks that were not tested during merging. Always evaluate against all source tasks:

import json
from lm_eval import evaluator

def evaluate_merged_model(model_path: str, tasks: list) -> dict:
    """Run lm-evaluation-harness on merged model."""
    results = evaluator.simple_evaluate(
        model="hf",
        model_args=f"pretrained={model_path},dtype=float16",
        tasks=tasks,
        num_fewshot=0,
        batch_size="auto",
    )
    return results["results"]

# Evaluate all relevant tasks
results = evaluate_merged_model(
    "./mistral-merged",
    tasks=["humaneval", "mbpp", "hellaswag", "truthfulqa_mc1"]
)

for task, metrics in results.items():
    print(f"{task}: {metrics}")

# Check for regressions vs individual fine-tuned models
baseline_results = {
    "humaneval": 0.42,    # coding model baseline
    "hellaswag": 0.81,    # instruction model baseline
}

for task, baseline in baseline_results.items():
    merged_score = results[task].get("acc", results[task].get("pass@1", 0))
    delta = merged_score - baseline
    status = "OK" if delta > -0.03 else "REGRESSION"
    print(f"{task}: {merged_score:.3f} (baseline {baseline:.3f}, delta {delta:+.3f}) [{status}]")

Choosing the Scaling Coefficient

The scaling coefficient $\lambda$ in $\theta_{merged} = \theta_{base} + \lambda \sum_i \tau_i$ is the most important hyperparameter in task vector merging. Too high: the model diverges from base behavior, can hallucinate or lose coherence. Too low: the fine-tuning signal is diluted and the merged model behaves like the base.

Grid search is cheap - each evaluation takes minutes:

def sweep_scaling_coefficient(
    base_path: str,
    task_vectors: list,
    eval_tasks: list,
    coefficients: list = [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8],
):
    """Sweep scaling coefficients and report quality for each."""
    best_coeff = None
    best_avg_score = -1

    for coeff in coefficients:
        # Build merged weights
        merged_weights = apply_task_vectors(base_path, task_vectors, coeff)
        save_temp_model(merged_weights, f"/tmp/merge_coeff_{coeff}")

        # Evaluate
        results = evaluate_merged_model(f"/tmp/merge_coeff_{coeff}", eval_tasks)
        avg_score = sum(v.get("acc", 0) for v in results.values()) / len(results)

        print(f"coeff={coeff:.1f}: avg_score={avg_score:.4f}")
        if avg_score > best_avg_score:
            best_avg_score = avg_score
            best_coeff = coeff

    print(f"\nBest coefficient: {best_coeff} (score={best_avg_score:.4f})")
    return best_coeff

Common Mistakes

:::danger Merging models from different base checkpoints Task vector arithmetic only works when all models share the same base checkpoint. If model A is fine-tuned from Mistral-7B-v0.1 and model B is fine-tuned from Mistral-7B-Instruct, the task vectors live in different coordinate systems. Adding them is meaningless - the resulting model will be incoherent. Always verify the base model SHA hash before merging. :::

:::danger Merging models with conflicting tokenizers If two models use different tokenizers (different vocabularies, different special tokens), the embedding layers will be misaligned. Even if the architecture is otherwise identical, the embedding and lm_head weights correspond to different tokens and cannot be merged. Always confirm tokenizer_config.json and vocab.json are identical between source models before merging. :::

:::warning Using density=1.0 for TIES or DARE Setting density to 1.0 in TIES means no trimming (keep all weight changes). This defeats the purpose of TIES - you are just doing a linear merge. The benefit of TIES comes from pruning the noise (small, likely random weight changes) before computing consensus signs. Use density=0.1 to 0.3 for TIES; higher values reduce the algorithm's effectiveness. :::

:::warning Merging then quantizing vs quantizing then merging Quantize the merged model, not the source models before merging. If you merge already-quantized models, quantization errors accumulate. The task vectors computed from quantized weights include quantization artifacts, which compound when summed. Merge in float16 or bfloat16, then apply post-merge quantization (bitsandbytes, GPTQ, AWQ) to the final merged model. :::

:::danger Skipping regression testing after merging A merged model can silently lose capability on the tasks you care about even when the merge looks successful on the primary metric. Always run a regression suite: if any task degrades by more than 3-5% relative to the best individual fine-tuned model, the merge parameters need adjustment (lower scaling coefficient, higher density for TIES, different combination method). Do not deploy a merged model that has been evaluated on only one task. :::

:::warning Frankenmerges with different training data distributions Frankenmerging works when the two source models have similar hidden state distributions at the split point. If model A was fine-tuned on medical text and model B on code, the layer activations at layer 16 will be in different statistical regimes. The late layers of model B will receive inputs that look nothing like what they were trained on. Frankenmerging is safest when both source models come from the same base and were fine-tuned on related data. :::

Interview Q&A

Q1: What is a task vector, and why does adding task vectors work to combine capabilities?

A task vector is the parameter-space difference between a fine-tuned model and its base model: $\tau = \theta_{ft} - \theta_{base}$ . It represents "everything the fine-tuning added" as a vector in the high-dimensional space of model weights.

Adding task vectors works because of a property called "intrinsic dimensionality" - fine-tuning changes a small fraction of the model's representational capacity in directions that are specific to the task. When two models are fine-tuned from the same base on different tasks, the directions of change tend to be nearly orthogonal in weight space (tasks encode in different parameter subspaces). Adding nearly orthogonal vectors gives a result that has components from both, analogous to how adding perpendicular vectors in 2D gives a diagonal vector that has components in both original directions.

The important caveats: this only works when tasks are sufficiently different (orthogonal in weight space), when both models come from the same base (same coordinate system), and when a scaling coefficient is applied to prevent the sum from overshooting the loss basin. When tasks are similar or conflicting (e.g., two models fine-tuned on different styles of the same task), task vectors interfere and merging degrades performance.

Q2: What specifically does TIES-Merging fix that simple task vector addition does not?

Simple task vector addition fails when two task vectors have opposite signs for the same parameter. Suppose parameter $w_{42}$ is increased by task A's fine-tuning (the value goes up) but decreased by task B's fine-tuning (it goes down). Adding the two task vectors at equal weight causes them to cancel, producing a merged model where $w_{42}$ barely changes from the base. But both task A and task B needed $w_{42}$ to change - just in different directions. The cancellation leaves both tasks underserved.

TIES resolves this with three steps. First, it trims small weight changes (likely noise, not signal), which reduces the chance of spurious interference. Second, it elects a consensus sign for each parameter by computing which direction has greater total magnitude across all task vectors. Third, it discards any task vector's contribution to a parameter where that contribution disagrees with the consensus sign.

The result: for parameter $w_{42}$ where task A says "increase" and task B says "decrease," TIES picks whichever direction has greater total magnitude and applies only that contribution. One task "wins" the parameter rather than both losing to cancellation. On benchmarks, TIES typically improves over linear merging by 2-5% absolute accuracy when merging models trained on diverse tasks.

Q3: You need to serve 100 task-specific variants of a fine-tuned model in production. What merging approach would you consider, and when would you reject it?

For 100 variants, the storage and serving architecture question dominates. There are two patterns:

Pattern 1 - Single merged model: merge the variants that share complementary capabilities into groups, produce a smaller set of merged models (say, 10 grouped models), serve those. This works when the task groups are non-conflicting and each merged model can handle its group adequately. Merging with TIES or DARE, sweep the scaling coefficient per group, regression test each merged model on all constituent tasks. This reduces 100 serving endpoints to 10.

Pattern 2 - LoRA hot-swapping: serve one base model, hot-swap LoRA adapters per request. With IA3 or small LoRA adapters (rank 4-8), adapter storage per task is small (1-64MB), loading is fast, and you get near-full task quality. This works for latency-tolerant workloads (adapter switch adds 50-200ms per context switch).

When to reject merging: if the 100 tasks are highly conflicting (document translation in 50 languages vs coding tasks), merging produces poor results because the task vectors interfere. In that case, the merged model will be mediocre at everything rather than excellent at specific tasks. Evaluation is the deciding factor: if a merged model degrades more than 5% on any constituent task, that merge is not deployable and the tasks must be served separately.

Q4: What is the difference between DARE and TIES, and when would you choose one over the other?

Both DARE and TIES are solutions to the weight interference problem in task vector merging, but they attack it differently.

TIES is deterministic. It analyzes which direction each parameter should change (sign election), eliminates conflict by keeping only the winning direction, and applies a structured pruning based on magnitude. The result is reproducible given the same input models and density hyperparameter.

DARE is stochastic. It randomly samples which parameters to include (keeping fraction $p$ ), scales survivors by $1/p$ , and relies on the law of large numbers: in expectation, enough non-interfering weight contributions survive to carry the task signal, while the probability that two interfering weights both survive is $p^2$ (much lower than $p$ ).

Choose TIES when: tasks are clearly different (different domains, different skills), you want a deterministic merge process, and you want interpretable control over which weights are included (by magnitude threshold).

Choose DARE when: tasks are related but trained with different data (same skill, different styles), you want to randomize away correlated noise that TIES magnitude thresholding would keep, or you want to run multiple DARE merges with different seeds and average them (stochastic ensemble-of-merges).

In practice, TIES is more commonly used for diverse task merging; DARE is more commonly used when blending similar-style models (e.g., two instruction-tuned models with different strengths).

Q5: How would you evaluate whether a merged model is actually better than running the specialized models separately?

The evaluation needs to cover three dimensions.

First, task quality: run each merged model on the benchmark suite for every constituent task. Compare against the individual fine-tuned models. The merged model should be within 3-5% of each specialist. If any task degrades more than 5%, the merge is not acceptable for that task - reconsider the merging parameters or drop that task from the merge.

Second, cross-task coherence: test prompts that combine capabilities (e.g., "explain this code in simple language" for a coding + instruction merge). A good merge handles these gracefully. A bad merge produces confused output that shows the task representations are fighting each other.

Third, cost-benefit analysis: compare inference cost of the merged model vs an ensemble (Nx cost) and vs a routed system (1x cost + routing overhead). The merged model wins if: (1) quality loss vs specialist is acceptable for the use case, (2) inference cost of running multiple models is genuinely prohibitive, and (3) routing complexity (maintaining a classifier, handling cold starts, managing multiple model versions) would be significant.

A merged model that is 3% worse than specialists but runs at 1/3 the cost is often the correct production choice. A merged model that is 15% worse is not - route to specialists instead.

Q6: Walk through the merge_and_unload operation mathematically. What exactly changes in the model weights?

merge_and_unload permanently incorporates the LoRA adapter into the base model weights. For each modified weight matrix $W_{base} \in \mathbb{R}^{d_{out} \times d_{in}}$ with LoRA matrices $A \in \mathbb{R}^{r \times d_{in}}$ and $B \in \mathbb{R}^{d_{out} \times r}$ and scaling factor $\alpha$ :

During forward pass with adapter: $h = W_{base}x + \frac{\alpha}{r}BAx = (W_{base} + \frac{\alpha}{r}BA)x$

After merge: $W_{merged} = W_{base} + \frac{\alpha}{r}BA$

The LoRA matrices are multiplied together ( $BA$ is a $d_{out} \times d_{in}$ matrix), scaled by $\alpha/r$ , and added element-wise to $W_{base}$ . The LoRA $A$ and $B$ matrices are then deleted.

The resulting $W_{merged}$ is exactly the same rank as $W_{base}$ (full rank) - the rank- $r$ update has been absorbed into the full matrix. The model is now identical in structure to the base model: same weight shapes, no adapter modules, no PEFT wrappers. It loads without PEFT installed and runs at exactly the same speed as the base model because there is no extra computation - the same matrix multiply that was $W_{base}x$ is now $W_{merged}x$ .

The only irreversible consequence: you cannot extract the adapter back. merge_and_unload is a one-way operation. Always keep the separate adapter files if you might need to modify or share the adapter independently.

Three Models Walk Into a Server Room​

Why This Exists - The Cost of Specialization​

Historical Context - From Soup to Surgery​

Core Concepts​

LoRA Adapter Merging - The Simple Case​

Combining Multiple LoRA Adapters - add_weighted_adapter​

Task Vectors - Weight Space Arithmetic​

SLERP - Spherical Linear Interpolation​

TIES-Merging - Resolving Interference​

DARE - Random Weight Dropping​

MergeKit - Production Model Merging​

Frankenmerges - Layer Grafting​

Architecture Diagram - Merging Methods Compared​

When Merging Beats Ensembling​

Practical: Full Merging Pipeline​

Production Engineering Notes​

Memory-Efficient Merging with MergeKit​

Evaluating Merged Models​

Choosing the Scaling Coefficient​

Common Mistakes​

Interview Q&A​