How DARE randomly drops delta weights and rescales the remainder to dramatically reduce interference when merging multiple fine-tuned models.

How does delta weight sparsification work in practice?

DARE - Delta Weight Sparsification covers DARE, delta weight sparsification, model merging from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/model-merging/dare

What is the difference between DARE and model merging?

See the full breakdown at https://engineersofai.com/docs/llms/model-merging/dare

DARE - Delta Weight Sparsification

A Surprisingly Destructive Experiment

Imagine taking a fine-tuned language model - something you've spent days and significant compute training - and randomly zeroing out 90% of its weight updates. Setting nine out of every ten changed parameters back to what the base model had. You'd expect the model to be ruined.

Mingjia Yu and colleagues at the Chinese Academy of Sciences tried exactly this. They took Llama-2-7B models fine-tuned on various tasks and randomly dropped 90% of the delta weights (the difference between fine-tuned and base). Then they measured performance.

The result, published in 2023 as "Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch," was astonishing. With 90% of the delta weights zeroed out, performance on the target task dropped by only 1-3%. The fine-tuned capability was almost entirely preserved.

This extreme redundancy in fine-tuned weight updates is the insight that DARE exploits. If most delta weights are redundant, then multiple models merged together have their redundant deltas interfering with each other - and you can safely prune most of them away before merging, dramatically reducing that interference. DARE is the mechanism that makes this pruning statistically unbiased.

Why Most Delta Weights Are Redundant

The Superposition Hypothesis Applied to Fine-Tuning

Modern neural networks are massively over-parameterized. A 7-billion parameter model fine-tuned on a 50K instruction dataset has roughly 140× more parameters than training examples. The model has far more capacity than needed to memorize the fine-tuning task.

This over-parameterization means that the fine-tuning signal can be distributed across far fewer parameters than actually changed. Fine-tuning updates many parameters because gradient descent touches all of them - but most of those touches are tiny adjustments that collectively contribute a small, diffuse signal. The actual "capability" learned during fine-tuning is encoded in a relatively concentrated subset of high-magnitude updates.

Empirical Evidence: The Dropping Experiment

Yu et al. verified this by measuring performance versus dropout rate across multiple tasks and model sizes:

Drop Rate	MMLU Accuracy (Llama-2-7B-chat)
0% (no dropping)	48.3%
50%	48.1%
70%	47.9%
90%	47.2%
95%	45.8%
99%	38.4%

The model retains nearly full capability with 90% of delta weights dropped. Only at 95%+ does significant degradation occur. This is a remarkable finding that fundamentally changes how you should think about what fine-tuning actually does.

The DARE Algorithm

DARE is simple in concept: randomly drop delta weights with probability $p$ , then rescale the remaining weights to preserve the expected update magnitude.

The Problem With Naive Dropping

If you simply zero out $p$ fraction of delta weights without any correction, you change the expected magnitude of the updates. Before dropping, each parameter is updated by its full delta. After dropping with probability $p$ , the expected update per parameter is $(1-p) \cdot \delta$ - you've scaled down the entire task vector by $(1-p)$ .

For fine-tuning, this matters: the magnitude of the delta weights is what pushes the model's behavior. Halving all delta weights produces a model halfway between the base and the fine-tuned model, not a model with 50% of parameters fully updated and 50% unchanged.

The Rescaling Fix

DARE addresses this with a simple correction: after dropping, multiply the remaining (non-dropped) delta weights by $\frac{1}{1-p}$ .

$\hat{\tau}[i] = \begin{cases} \frac{\tau[i]}{1-p} & \text{with probability } (1-p) \text{ (keep)} \\ 0 & \text{with probability } p \text{ (drop)} \end{cases}$

The expected value of $\hat{\tau}[i]$ is:

$\mathbb{E}[\hat{\tau}[i]] = (1-p) \cdot \frac{\tau[i]}{1-p} + p \cdot 0 = \tau[i]$

The rescaling makes the operation unbiased: in expectation, the DARE-processed task vector is identical to the original. The randomness introduces variance, but the expectation is preserved.

This is exactly the same principle as dropout in neural network training (Srivastava et al. 2014). Dropout randomly zeros activations during training and scales up the remaining ones by $\frac{1}{1-p}$ to preserve expected activation magnitudes. DARE applies the same idea to weight deltas rather than activations.

Full DARE Implementation

import torch
from safetensors.torch import load_file, save_file


def dare_drop(
    task_vector: dict[str, torch.Tensor],
    drop_rate: float = 0.9,
    seed: int | None = None,
    rescale: bool = True,
) -> dict[str, torch.Tensor]:
    """
    DARE: Drop And REscale delta weights.

    Randomly zeros out `drop_rate` fraction of delta weights,
    then rescales the remainder by 1/(1-drop_rate) to preserve expectation.

    Parameters
    ----------
    task_vector : dict of parameter name -> delta tensor (fine-tuned - base)
    drop_rate   : probability of zeroing each delta weight (default: 0.9)
    seed        : random seed for reproducibility
    rescale     : whether to rescale surviving weights by 1/(1-drop_rate)

    Returns
    -------
    Sparsified task vector with same keys as input.
    """
    assert 0.0 <= drop_rate < 1.0, "drop_rate must be in [0, 1)"

    if seed is not None:
        rng = torch.Generator()
        rng.manual_seed(seed)
    else:
        rng = None

    result = {}
    total_params = 0
    kept_params = 0

    for key, delta in task_vector.items():
        # Generate random mask: True = keep, False = drop
        if rng is not None:
            mask = torch.bernoulli(
                torch.ones_like(delta) * (1 - drop_rate), generator=rng
            ).bool()
        else:
            mask = torch.bernoulli(
                torch.ones_like(delta) * (1 - drop_rate)
            ).bool()

        # Apply mask
        sparsified = delta * mask.float()

        # Rescale surviving weights to preserve expected value
        if rescale and (1 - drop_rate) > 0:
            sparsified = sparsified / (1 - drop_rate)

        result[key] = sparsified
        total_params += delta.numel()
        kept_params += mask.sum().item()

    actual_drop_rate = 1 - (kept_params / total_params)
    print(f"  DARE: dropped {actual_drop_rate*100:.1f}% of parameters (target: {drop_rate*100:.0f}%)")
    return result


def dare_merge(
    base_path: str,
    finetuned_paths: list[str],
    output_path: str,
    drop_rate: float = 0.9,
    lambda_coeff: float = 1.0,
    merge_method: str = "linear",
    dtype: torch.dtype = torch.bfloat16,
    seed: int | None = 42,
) -> None:
    """
    DARE preprocessing followed by model merging.

    DARE is typically used as a preprocessing step before linear averaging
    or TIES merging, not as a standalone merge method.

    Parameters
    ----------
    drop_rate     : fraction of delta weights to drop (default: 0.9)
    lambda_coeff  : scaling applied to merged task vector
    merge_method  : "linear" (simple average) or "ties" (use with TIES merging)
    """
    print(f"DARE Merge: {len(finetuned_paths)} models")
    print(f"  drop_rate={drop_rate:.0%}, lambda={lambda_coeff}, method={merge_method}")

    # Load base model
    base = {k: v.float() for k, v in load_file(base_path).items()}

    # Compute task vectors
    dare_vectors = []
    for i, ft_path in enumerate(finetuned_paths):
        ft = load_file(ft_path)
        raw_tv = {k: ft[k].float() - base[k] for k in base if k in ft}

        print(f"  Applying DARE to model {i+1}/{len(finetuned_paths)}...")
        dare_tv = dare_drop(raw_tv, drop_rate=drop_rate, seed=seed + i if seed else None)
        dare_vectors.append(dare_tv)

    # Merge the DARE-processed task vectors
    if merge_method == "linear":
        # Simple average of all DARE-processed vectors
        merged_vector = {}
        n = len(dare_vectors)
        all_keys = set().union(*[set(tv.keys()) for tv in dare_vectors])
        for key in all_keys:
            total = sum(dv.get(key, torch.zeros_like(base[key])) for dv in dare_vectors)
            merged_vector[key] = total / n

    elif merge_method == "ties":
        # Use TIES after DARE preprocessing (this is the DARE+TIES combination)
        from .ties import ties_elect_sign, ties_disjoint_merge
        # DARE already handles the trimming aspect - no need for TIES trim step
        elected = ties_elect_sign(dare_vectors)
        merged_vector = ties_disjoint_merge(dare_vectors, elected)

    else:
        raise ValueError(f"Unknown merge_method: {merge_method}")

    # Apply to base model
    result = {}
    for key in base:
        delta = merged_vector.get(key, torch.zeros_like(base[key]))
        result[key] = (base[key] + lambda_coeff * delta).to(dtype)

    save_file(result, output_path)
    print(f"DARE merged model saved to {output_path}")

DARE + TIES - The Standard Recipe

DARE and TIES are complementary. DARE addresses the problem of redundant, interfering parameters. TIES addresses the problem of conflicting update directions. Used together - DARE first (to sparsify), then TIES (to resolve remaining conflicts) - they consistently outperform either method alone.

The combined DARE+TIES pipeline:

When using DARE before TIES, you can reduce TIES's trim threshold because DARE has already sparsified the vectors. A keep_ratio of 0.5 after DARE (which has already dropped 90%) is more aggressive than keep_ratio=0.2 on the original dense vectors.

def dare_ties_merge(
    base_path: str,
    finetuned_paths: list[str],
    output_path: str,
    dare_drop_rate: float = 0.9,
    ties_keep_ratio: float = 0.5,   # Less aggressive after DARE
    lambda_coeff: float = 1.0,
    dtype: torch.dtype = torch.bfloat16,
) -> None:
    """
    DARE + TIES combined merging.

    DARE first (reduce redundancy), then TIES (resolve sign conflicts).
    This is the recommended approach for merging 3+ diverse models.
    """
    base = {k: v.float() for k, v in load_file(base_path).items()}

    # 1. Compute and DARE-process all task vectors
    dare_vecs = []
    for i, path in enumerate(finetuned_paths):
        ft = load_file(path)
        raw = {k: ft[k].float() - base[k] for k in base if k in ft}
        dare_vec = dare_drop(raw, drop_rate=dare_drop_rate, seed=42 + i)
        dare_vecs.append(dare_vec)

    # 2. TIES on the DARE-processed vectors
    # Trim: remove low-magnitude survivors from DARE
    trimmed = []
    from .ties import ties_trim, ties_elect_sign, ties_disjoint_merge
    for dv in dare_vecs:
        trimmed.append(ties_trim(dv, keep_ratio=ties_keep_ratio))

    # Elect signs
    elected = ties_elect_sign(trimmed)

    # Disjoint merge
    merged_vector = ties_disjoint_merge(trimmed, elected)

    # Apply to base
    result = {
        key: (base[key] + lambda_coeff * merged_vector.get(key, torch.zeros_like(base[key]))).to(dtype)
        for key in base
    }
    save_file(result, output_path)
    print(f"DARE+TIES merged model saved to {output_path}")

Choosing the Drop Rate

The drop rate $p$ is the primary hyperparameter in DARE. The choice depends on several factors:

Drop Rate Guidelines

Model Size   Fine-tuning Steps   Recommended Drop Rate
──────────────────────────────────────────────────────
7B           500-2K              0.85-0.90
7B           2K-10K              0.90-0.95
13B          500-2K              0.85-0.90
70B          Any                 0.90-0.95
LoRA (any)   Any                 0.70-0.85   ← LoRA already sparse

LoRA-tuned models have a different delta structure: the updates are already low-rank, meaning they're concentrated in specific directions in parameter space. DARE is still useful for LoRA merges but the optimal drop rate is lower because LoRA deltas are already "compressed."

The Full Fine-tune vs LoRA Distinction

def detect_delta_sparsity(task_vector: dict) -> float:
    """
    Measure the natural sparsity of a task vector.
    LoRA-derived deltas will appear low-rank, not sparse.
    """
    all_vals = []
    for key, delta in task_vector.items():
        all_vals.append(delta.abs().flatten())

    all_vals = torch.cat(all_vals)

    # Fraction of deltas below 1% of max
    threshold = 0.01 * all_vals.max().item()
    near_zero_frac = (all_vals < threshold).float().mean().item()

    print(f"Natural sparsity: {near_zero_frac*100:.1f}% of parameters are near-zero")
    print(f"Max delta: {all_vals.max().item():.6f}")
    print(f"Mean delta: {all_vals.mean().item():.6f}")
    print(f"Std delta: {all_vals.std().item():.6f}")

    return near_zero_frac

# If natural sparsity > 70%, the model may already be sparse enough
# that DARE provides less benefit (or use a lower drop rate)

DARE vs TIES - When to Use Each

Situation	Recommended Method
2 diverse models, full fine-tune	DARE + linear average
2 models, same domain	Simple linear interpolation
3-5 models, mixed domains	DARE + TIES
5+ models	DARE + TIES (mandatory)
LoRA adapters	DARE (lower drop rate) + linear
Very large models (70B+)	DARE is critical (reduces RAM needs too)

The "Super Mario" Paper - What the Name Means

Yu et al. titled their paper "Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch." The Super Mario analogy refers to the video game mechanic where Mario can absorb power-ups - in this case, the base model absorbs capabilities from homologous (same-base) fine-tuned models.

The "free lunch" framing was intentional and controversial: the paper claimed that DARE allows merging with essentially zero degradation to base capabilities. Subsequent work found this to be somewhat optimistic - there is interference, just less than without DARE. But the core insight (most delta weights are redundant; dropping them reduces merge interference) has held up well.

The paper showed DARE working across:

Merging 3 models: WizardLM (instruction), WizardMath (math reasoning), Code Llama (code)
Base: Llama-2-7B
With DARE at 90% drop rate + simple averaging: retained 99.6% of each model's individual task performance

Production Engineering Notes

:::tip DARE reduces memory requirements during merging One underappreciated benefit of DARE: sparse tensors use less memory. If you DARE-process task vectors to 90% sparsity before merging, you can potentially store them in sparse format, reducing the RAM needed for the merge operation. For 70B models where merging requires 200GB+ of RAM, this matters. :::

:::note DARE is embarrassingly parallelizable Each model's task vector is DARE-processed independently. You can distribute this processing across multiple machines or GPU workers. The bottleneck is I/O (loading large model checkpoints), not compute. :::

:::tip Deterministic DARE with fixed seeds Always fix the random seed when using DARE if you need reproducibility. Different seeds produce different sparsity patterns, which can lead to slightly different merged models. For production, fix the seed and version-control the merging configuration. :::

Common Mistakes

:::danger Don't skip the rescaling step Dropping delta weights without rescaling changes the expected magnitude of the merged task vector. Your merged model will behave as if a much weaker version of each fine-tune was applied. The rescaling factor $1/(1-p)$ is critical to preserving the intended capability strength. Never implement "DARE" by just dropping without rescaling. :::

:::warning Don't use the same random seed for all models If you apply the same random seed to all models' task vectors, the same parameter positions are dropped across all models. This defeats the purpose - you're not reducing interference by dropping the same positions in each model; you're just uniformly reducing all deltas. Use different seeds (e.g., 42, 43, 44...) for each model. :::

:::danger Don't apply DARE to the embedding layer without care The token embedding and LM head matrices behave differently from attention and MLP weights. They have very different delta distributions. Applying the same drop rate to embeddings as to transformer layers can degrade tokenization-sensitive behavior (multilingual models, coding models with special tokens). Consider applying a lower drop rate to embeddings. :::

Interview Q&A

Q: What is DARE and why does it work?

A: DARE (Drop And REscale) randomly zeros out a large fraction - typically 90% - of delta weights (fine-tuned minus base weights) and then rescales the remaining weights by $1/(1-p)$ to preserve the expected magnitude. It works because delta weights from fine-tuning are extremely redundant: experiments show that dropping 90% of delta weights causes only 1-3% performance degradation on the fine-tuned task. This redundancy means that when merging multiple models, most of the interference comes from these redundant low-signal parameters. DARE removes most of them before merging, dramatically reducing inter-model interference.

Q: Why is rescaling essential in DARE? What happens without it?

A: Without rescaling, dropping $p$ fraction of delta weights scales down the expected task vector magnitude by $(1-p)$ . For $p=0.9$ , this means the merged task vector has only 10% of the intended magnitude - the model behaves as if fine-tuning was 10× weaker than it was. The rescaling factor $1/(1-p)$ compensates: since only $(1-p)$ fraction of weights survive, each surviving weight is scaled up by $1/(1-p)$ , making the expected value of each parameter identical to the unmodified delta. This is mathematically identical to the inverted dropout trick in neural network training.

Q: How does DARE differ from TIES trimming?

A: Both methods sparsify delta weights, but differently. TIES trimming is deterministic: it removes the lowest-magnitude weights, keeping the top- $k\%$ by absolute value. The selection is based on magnitude. DARE is stochastic: it removes random parameters regardless of magnitude, then rescales. TIES trimming concentrates on removing noise (low-magnitude = likely noise). DARE's random dropping preserves the unbiasedness property - the expected value of the DARE-processed vector equals the original - but doesn't specifically target noise. In practice, DARE+TIES combines both: DARE reduces redundancy through stochastic dropping, then TIES resolves remaining sign conflicts.

Q: What is the optimal drop rate for DARE and how do you determine it?

A: There is no universal optimal drop rate, but 0.90 (dropping 90% of delta weights) is a good starting point for fully fine-tuned 7B models. The optimal rate depends on fine-tuning intensity (more steps → more concentrated signal → higher drop rate tolerated), model size (larger models → more redundancy → higher drop rate), and the number of models being merged (more models → more benefit from higher drop rate). For LoRA-tuned models, use 0.70-0.85 since LoRA updates are already more concentrated. Always verify on a held-out evaluation set, sweeping over [0.80, 0.85, 0.90, 0.95].

Q: In what situations would you use DARE alone versus DARE+TIES?

A: DARE alone (followed by simple averaging) is sufficient when merging 2 models whose tasks are relatively similar, or when the main concern is reducing redundancy rather than resolving directional conflicts. DARE+TIES is recommended when merging 3 or more models with diverse capabilities (e.g., code + math + multilingual), when tasks have opposing optimization directions (e.g., safe generation vs creative writing), or when simple averaging produces noticeably degraded results. DARE+TIES consistently outperforms either method alone for 3+ diverse model merges.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Model Merging: TIES, DARE & SLERP demo on the EngineersOfAI Playground - no code required.

:::

A Surprisingly Destructive Experiment​

Why Most Delta Weights Are Redundant​

The Superposition Hypothesis Applied to Fine-Tuning​

Empirical Evidence: The Dropping Experiment​

The DARE Algorithm​

The Problem With Naive Dropping​

The Rescaling Fix​

Full DARE Implementation​

DARE + TIES - The Standard Recipe​

Choosing the Drop Rate​

Drop Rate Guidelines​

The Full Fine-tune vs LoRA Distinction​

DARE vs TIES - When to Use Each​

The "Super Mario" Paper - What the Name Means​

Production Engineering Notes​

Common Mistakes​

Interview Q&A​