DARE - Delta Weight Sparsification
A Surprisingly Destructive Experiment
Imagine taking a fine-tuned language model - something you've spent days and significant compute training - and randomly zeroing out 90% of its weight updates. Setting nine out of every ten changed parameters back to what the base model had. You'd expect the model to be ruined.
Mingjia Yu and colleagues at the Chinese Academy of Sciences tried exactly this. They took Llama-2-7B models fine-tuned on various tasks and randomly dropped 90% of the delta weights (the difference between fine-tuned and base). Then they measured performance.
The result, published in 2023 as "Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch," was astonishing. With 90% of the delta weights zeroed out, performance on the target task dropped by only 1-3%. The fine-tuned capability was almost entirely preserved.
This extreme redundancy in fine-tuned weight updates is the insight that DARE exploits. If most delta weights are redundant, then multiple models merged together have their redundant deltas interfering with each other - and you can safely prune most of them away before merging, dramatically reducing that interference. DARE is the mechanism that makes this pruning statistically unbiased.
Why Most Delta Weights Are Redundant
The Superposition Hypothesis Applied to Fine-Tuning
Modern neural networks are massively over-parameterized. A 7-billion parameter model fine-tuned on a 50K instruction dataset has roughly 140× more parameters than training examples. The model has far more capacity than needed to memorize the fine-tuning task.
This over-parameterization means that the fine-tuning signal can be distributed across far fewer parameters than actually changed. Fine-tuning updates many parameters because gradient descent touches all of them - but most of those touches are tiny adjustments that collectively contribute a small, diffuse signal. The actual "capability" learned during fine-tuning is encoded in a relatively concentrated subset of high-magnitude updates.
Empirical Evidence: The Dropping Experiment
Yu et al. verified this by measuring performance versus dropout rate across multiple tasks and model sizes:
| Drop Rate | MMLU Accuracy (Llama-2-7B-chat) |
|---|---|
| 0% (no dropping) | 48.3% |
| 50% | 48.1% |
| 70% | 47.9% |
| 90% | 47.2% |
| 95% | 45.8% |
| 99% | 38.4% |
The model retains nearly full capability with 90% of delta weights dropped. Only at 95%+ does significant degradation occur. This is a remarkable finding that fundamentally changes how you should think about what fine-tuning actually does.
The DARE Algorithm
DARE is simple in concept: randomly drop delta weights with probability , then rescale the remaining weights to preserve the expected update magnitude.
The Problem With Naive Dropping
If you simply zero out fraction of delta weights without any correction, you change the expected magnitude of the updates. Before dropping, each parameter is updated by its full delta. After dropping with probability , the expected update per parameter is - you've scaled down the entire task vector by .
For fine-tuning, this matters: the magnitude of the delta weights is what pushes the model's behavior. Halving all delta weights produces a model halfway between the base and the fine-tuned model, not a model with 50% of parameters fully updated and 50% unchanged.
The Rescaling Fix
DARE addresses this with a simple correction: after dropping, multiply the remaining (non-dropped) delta weights by .
The expected value of is:
The rescaling makes the operation unbiased: in expectation, the DARE-processed task vector is identical to the original. The randomness introduces variance, but the expectation is preserved.
This is exactly the same principle as dropout in neural network training (Srivastava et al. 2014). Dropout randomly zeros activations during training and scales up the remaining ones by to preserve expected activation magnitudes. DARE applies the same idea to weight deltas rather than activations.
Full DARE Implementation
import torch
from safetensors.torch import load_file, save_file
def dare_drop(
task_vector: dict[str, torch.Tensor],
drop_rate: float = 0.9,
seed: int | None = None,
rescale: bool = True,
) -> dict[str, torch.Tensor]:
"""
DARE: Drop And REscale delta weights.
Randomly zeros out `drop_rate` fraction of delta weights,
then rescales the remainder by 1/(1-drop_rate) to preserve expectation.
Parameters
----------
task_vector : dict of parameter name -> delta tensor (fine-tuned - base)
drop_rate : probability of zeroing each delta weight (default: 0.9)
seed : random seed for reproducibility
rescale : whether to rescale surviving weights by 1/(1-drop_rate)
Returns
-------
Sparsified task vector with same keys as input.
"""
assert 0.0 <= drop_rate < 1.0, "drop_rate must be in [0, 1)"
if seed is not None:
rng = torch.Generator()
rng.manual_seed(seed)
else:
rng = None
result = {}
total_params = 0
kept_params = 0
for key, delta in task_vector.items():
# Generate random mask: True = keep, False = drop
if rng is not None:
mask = torch.bernoulli(
torch.ones_like(delta) * (1 - drop_rate), generator=rng
).bool()
else:
mask = torch.bernoulli(
torch.ones_like(delta) * (1 - drop_rate)
).bool()
# Apply mask
sparsified = delta * mask.float()
# Rescale surviving weights to preserve expected value
if rescale and (1 - drop_rate) > 0:
sparsified = sparsified / (1 - drop_rate)
result[key] = sparsified
total_params += delta.numel()
kept_params += mask.sum().item()
actual_drop_rate = 1 - (kept_params / total_params)
print(f" DARE: dropped {actual_drop_rate*100:.1f}% of parameters (target: {drop_rate*100:.0f}%)")
return result
def dare_merge(
base_path: str,
finetuned_paths: list[str],
output_path: str,
drop_rate: float = 0.9,
lambda_coeff: float = 1.0,
merge_method: str = "linear",
dtype: torch.dtype = torch.bfloat16,
seed: int | None = 42,
) -> None:
"""
DARE preprocessing followed by model merging.
DARE is typically used as a preprocessing step before linear averaging
or TIES merging, not as a standalone merge method.
Parameters
----------
drop_rate : fraction of delta weights to drop (default: 0.9)
lambda_coeff : scaling applied to merged task vector
merge_method : "linear" (simple average) or "ties" (use with TIES merging)
"""
print(f"DARE Merge: {len(finetuned_paths)} models")
print(f" drop_rate={drop_rate:.0%}, lambda={lambda_coeff}, method={merge_method}")
# Load base model
base = {k: v.float() for k, v in load_file(base_path).items()}
# Compute task vectors
dare_vectors = []
for i, ft_path in enumerate(finetuned_paths):
ft = load_file(ft_path)
raw_tv = {k: ft[k].float() - base[k] for k in base if k in ft}
print(f" Applying DARE to model {i+1}/{len(finetuned_paths)}...")
dare_tv = dare_drop(raw_tv, drop_rate=drop_rate, seed=seed + i if seed else None)
dare_vectors.append(dare_tv)
# Merge the DARE-processed task vectors
if merge_method == "linear":
# Simple average of all DARE-processed vectors
merged_vector = {}
n = len(dare_vectors)
all_keys = set().union(*[set(tv.keys()) for tv in dare_vectors])
for key in all_keys:
total = sum(dv.get(key, torch.zeros_like(base[key])) for dv in dare_vectors)
merged_vector[key] = total / n
elif merge_method == "ties":
# Use TIES after DARE preprocessing (this is the DARE+TIES combination)
from .ties import ties_elect_sign, ties_disjoint_merge
# DARE already handles the trimming aspect - no need for TIES trim step
elected = ties_elect_sign(dare_vectors)
merged_vector = ties_disjoint_merge(dare_vectors, elected)
else:
raise ValueError(f"Unknown merge_method: {merge_method}")
# Apply to base model
result = {}
for key in base:
delta = merged_vector.get(key, torch.zeros_like(base[key]))
result[key] = (base[key] + lambda_coeff * delta).to(dtype)
save_file(result, output_path)
print(f"DARE merged model saved to {output_path}")
DARE + TIES - The Standard Recipe
DARE and TIES are complementary. DARE addresses the problem of redundant, interfering parameters. TIES addresses the problem of conflicting update directions. Used together - DARE first (to sparsify), then TIES (to resolve remaining conflicts) - they consistently outperform either method alone.
The combined DARE+TIES pipeline:
When using DARE before TIES, you can reduce TIES's trim threshold because DARE has already sparsified the vectors. A keep_ratio of 0.5 after DARE (which has already dropped 90%) is more aggressive than keep_ratio=0.2 on the original dense vectors.
def dare_ties_merge(
base_path: str,
finetuned_paths: list[str],
output_path: str,
dare_drop_rate: float = 0.9,
ties_keep_ratio: float = 0.5, # Less aggressive after DARE
lambda_coeff: float = 1.0,
dtype: torch.dtype = torch.bfloat16,
) -> None:
"""
DARE + TIES combined merging.
DARE first (reduce redundancy), then TIES (resolve sign conflicts).
This is the recommended approach for merging 3+ diverse models.
"""
base = {k: v.float() for k, v in load_file(base_path).items()}
# 1. Compute and DARE-process all task vectors
dare_vecs = []
for i, path in enumerate(finetuned_paths):
ft = load_file(path)
raw = {k: ft[k].float() - base[k] for k in base if k in ft}
dare_vec = dare_drop(raw, drop_rate=dare_drop_rate, seed=42 + i)
dare_vecs.append(dare_vec)
# 2. TIES on the DARE-processed vectors
# Trim: remove low-magnitude survivors from DARE
trimmed = []
from .ties import ties_trim, ties_elect_sign, ties_disjoint_merge
for dv in dare_vecs:
trimmed.append(ties_trim(dv, keep_ratio=ties_keep_ratio))
# Elect signs
elected = ties_elect_sign(trimmed)
# Disjoint merge
merged_vector = ties_disjoint_merge(trimmed, elected)
# Apply to base
result = {
key: (base[key] + lambda_coeff * merged_vector.get(key, torch.zeros_like(base[key]))).to(dtype)
for key in base
}
save_file(result, output_path)
print(f"DARE+TIES merged model saved to {output_path}")
Choosing the Drop Rate
The drop rate is the primary hyperparameter in DARE. The choice depends on several factors:
Drop Rate Guidelines
Model Size Fine-tuning Steps Recommended Drop Rate
──────────────────────────────────────────────────────
7B 500-2K 0.85-0.90
7B 2K-10K 0.90-0.95
13B 500-2K 0.85-0.90
70B Any 0.90-0.95
LoRA (any) Any 0.70-0.85 ← LoRA already sparse
LoRA-tuned models have a different delta structure: the updates are already low-rank, meaning they're concentrated in specific directions in parameter space. DARE is still useful for LoRA merges but the optimal drop rate is lower because LoRA deltas are already "compressed."
The Full Fine-tune vs LoRA Distinction
def detect_delta_sparsity(task_vector: dict) -> float:
"""
Measure the natural sparsity of a task vector.
LoRA-derived deltas will appear low-rank, not sparse.
"""
all_vals = []
for key, delta in task_vector.items():
all_vals.append(delta.abs().flatten())
all_vals = torch.cat(all_vals)
# Fraction of deltas below 1% of max
threshold = 0.01 * all_vals.max().item()
near_zero_frac = (all_vals < threshold).float().mean().item()
print(f"Natural sparsity: {near_zero_frac*100:.1f}% of parameters are near-zero")
print(f"Max delta: {all_vals.max().item():.6f}")
print(f"Mean delta: {all_vals.mean().item():.6f}")
print(f"Std delta: {all_vals.std().item():.6f}")
return near_zero_frac
# If natural sparsity > 70%, the model may already be sparse enough
# that DARE provides less benefit (or use a lower drop rate)
DARE vs TIES - When to Use Each
| Situation | Recommended Method |
|---|---|
| 2 diverse models, full fine-tune | DARE + linear average |
| 2 models, same domain | Simple linear interpolation |
| 3-5 models, mixed domains | DARE + TIES |
| 5+ models | DARE + TIES (mandatory) |
| LoRA adapters | DARE (lower drop rate) + linear |
| Very large models (70B+) | DARE is critical (reduces RAM needs too) |
The "Super Mario" Paper - What the Name Means
Yu et al. titled their paper "Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch." The Super Mario analogy refers to the video game mechanic where Mario can absorb power-ups - in this case, the base model absorbs capabilities from homologous (same-base) fine-tuned models.
The "free lunch" framing was intentional and controversial: the paper claimed that DARE allows merging with essentially zero degradation to base capabilities. Subsequent work found this to be somewhat optimistic - there is interference, just less than without DARE. But the core insight (most delta weights are redundant; dropping them reduces merge interference) has held up well.
The paper showed DARE working across:
- Merging 3 models: WizardLM (instruction), WizardMath (math reasoning), Code Llama (code)
- Base: Llama-2-7B
- With DARE at 90% drop rate + simple averaging: retained 99.6% of each model's individual task performance
Production Engineering Notes
:::tip DARE reduces memory requirements during merging One underappreciated benefit of DARE: sparse tensors use less memory. If you DARE-process task vectors to 90% sparsity before merging, you can potentially store them in sparse format, reducing the RAM needed for the merge operation. For 70B models where merging requires 200GB+ of RAM, this matters. :::
:::note DARE is embarrassingly parallelizable Each model's task vector is DARE-processed independently. You can distribute this processing across multiple machines or GPU workers. The bottleneck is I/O (loading large model checkpoints), not compute. :::
:::tip Deterministic DARE with fixed seeds Always fix the random seed when using DARE if you need reproducibility. Different seeds produce different sparsity patterns, which can lead to slightly different merged models. For production, fix the seed and version-control the merging configuration. :::
Common Mistakes
:::danger Don't skip the rescaling step Dropping delta weights without rescaling changes the expected magnitude of the merged task vector. Your merged model will behave as if a much weaker version of each fine-tune was applied. The rescaling factor is critical to preserving the intended capability strength. Never implement "DARE" by just dropping without rescaling. :::
:::warning Don't use the same random seed for all models If you apply the same random seed to all models' task vectors, the same parameter positions are dropped across all models. This defeats the purpose - you're not reducing interference by dropping the same positions in each model; you're just uniformly reducing all deltas. Use different seeds (e.g., 42, 43, 44...) for each model. :::
:::danger Don't apply DARE to the embedding layer without care The token embedding and LM head matrices behave differently from attention and MLP weights. They have very different delta distributions. Applying the same drop rate to embeddings as to transformer layers can degrade tokenization-sensitive behavior (multilingual models, coding models with special tokens). Consider applying a lower drop rate to embeddings. :::
Interview Q&A
Q: What is DARE and why does it work?
A: DARE (Drop And REscale) randomly zeros out a large fraction - typically 90% - of delta weights (fine-tuned minus base weights) and then rescales the remaining weights by to preserve the expected magnitude. It works because delta weights from fine-tuning are extremely redundant: experiments show that dropping 90% of delta weights causes only 1-3% performance degradation on the fine-tuned task. This redundancy means that when merging multiple models, most of the interference comes from these redundant low-signal parameters. DARE removes most of them before merging, dramatically reducing inter-model interference.
Q: Why is rescaling essential in DARE? What happens without it?
A: Without rescaling, dropping fraction of delta weights scales down the expected task vector magnitude by . For , this means the merged task vector has only 10% of the intended magnitude - the model behaves as if fine-tuning was 10× weaker than it was. The rescaling factor compensates: since only fraction of weights survive, each surviving weight is scaled up by , making the expected value of each parameter identical to the unmodified delta. This is mathematically identical to the inverted dropout trick in neural network training.
Q: How does DARE differ from TIES trimming?
A: Both methods sparsify delta weights, but differently. TIES trimming is deterministic: it removes the lowest-magnitude weights, keeping the top- by absolute value. The selection is based on magnitude. DARE is stochastic: it removes random parameters regardless of magnitude, then rescales. TIES trimming concentrates on removing noise (low-magnitude = likely noise). DARE's random dropping preserves the unbiasedness property - the expected value of the DARE-processed vector equals the original - but doesn't specifically target noise. In practice, DARE+TIES combines both: DARE reduces redundancy through stochastic dropping, then TIES resolves remaining sign conflicts.
Q: What is the optimal drop rate for DARE and how do you determine it?
A: There is no universal optimal drop rate, but 0.90 (dropping 90% of delta weights) is a good starting point for fully fine-tuned 7B models. The optimal rate depends on fine-tuning intensity (more steps → more concentrated signal → higher drop rate tolerated), model size (larger models → more redundancy → higher drop rate), and the number of models being merged (more models → more benefit from higher drop rate). For LoRA-tuned models, use 0.70-0.85 since LoRA updates are already more concentrated. Always verify on a held-out evaluation set, sweeping over [0.80, 0.85, 0.90, 0.95].
Q: In what situations would you use DARE alone versus DARE+TIES?
A: DARE alone (followed by simple averaging) is sufficient when merging 2 models whose tasks are relatively similar, or when the main concern is reducing redundancy rather than resolving directional conflicts. DARE+TIES is recommended when merging 3 or more models with diverse capabilities (e.g., code + math + multilingual), when tasks have opposing optimization directions (e.g., safe generation vs creative writing), or when simple averaging produces noticeably degraded results. DARE+TIES consistently outperforms either method alone for 3+ diverse model merges.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Model Merging: TIES, DARE & SLERP demo on the EngineersOfAI Playground - no code required.
:::
