TIES Merging - Resolving Sign Conflicts
When Simple Averaging Fails Badly
Your team has three fine-tuned models you want to combine: one specialized for technical writing, one for code generation, and one for multilingual translation. Simple task arithmetic feels obvious - compute three task vectors, add them together, done.
You run the evaluation. Technical writing: degraded. Code generation: degraded. Translation: slightly better than the base but worse than the fine-tune. You've managed to produce a model that's worse than any individual fine-tune. What went wrong?
The answer is sign conflicts. When the technical writing model nudges a weight toward +0.3 and the code model nudges the same weight toward -0.2, naive averaging brings that parameter to +0.05. Neither capability gets what it wanted. The two models have actively interfered with each other, and the result is a model that's pulled in incompatible directions on thousands of parameters simultaneously.
TIES merging - introduced by Yadav et al. in 2023 - directly attacks this problem. Instead of averaging conflicting updates, it uses a majority vote on parameter signs and selectively discards updates that create conflict. The result is a merged model where capabilities don't cancel each other out.
The Problem: Sign Conflicts in Detail
Why Conflicting Signs Cancel Capabilities
Consider a single weight parameter shared across three fine-tuned models. Each model has updated this weight from its base value:
- Model 1 (technical writing): (this weight is important for technical phrasing)
- Model 2 (code generation): (this weight is also useful for code)
- Model 3 (translation): (translation needs this weight decreased)
Simple average:
Result: the parameter moves up by only 0.1, which is far too small to provide the +0.45 average benefit that models 1 and 2 need, and is in the wrong direction for model 3. All three models are poorly served.
If we instead identify that the majority wants a positive direction (+0.5 and +0.4 agree, -0.6 disagrees), we could apply the average of the agreeing deltas: . Models 1 and 2 are well-served; model 3's contribution to this parameter is simply discarded.
This is essentially what TIES does - formalized into a three-step algorithm.
Why Small Delta Weights Also Cause Problems
Beyond sign conflicts, simple averaging has a second failure mode: small, noisy delta weights.
When you fine-tune a large model, most weights change very little. The weight updates follow a long-tail distribution - a small fraction of parameters change significantly (these actually encode the new capability) while the vast majority of parameters change by tiny amounts that represent noise in the training process.
When you average multiple models, these noisy small updates from all models combine, producing interference that doesn't represent any meaningful capability. The high-magnitude updates (the real signal) are diluted by the accumulated noise from all the low-magnitude changes.
TIES addresses this by trimming - removing delta weights below a threshold before merging.
The TIES Algorithm
TIES stands for TrIm, Elect sign, diSjoint merge. Each step targets a specific failure mode of naive averaging.
Step 1: Trim (TrIm)
For each model , keep only the top- of delta weights by absolute magnitude. Set everything else to zero.
Where is the -th percentile of .
The trim fraction is typically set to 20% - keep the top 20% of parameters by magnitude, zero out the bottom 80%. This aggressively prunes the noisy low-magnitude updates while preserving the high-magnitude updates that actually encode capability.
Step 2: Elect Sign (Elect)
For each parameter position , determine the "winning" sign by looking at which direction most models' trimmed deltas point:
This is a soft majority vote weighted by magnitude: the sum of all trimmed deltas tends to be positive if most large-magnitude updates are positive, even if some models push negative.
Step 3: Disjoint Merge (diSjoint)
For each parameter , only include updates from models whose trimmed delta agrees with the elected sign :
Average only over models in the agreement set:
Finally, apply the merged task vector to the base model with a scaling coefficient :
Full TIES Implementation
import torch
from safetensors.torch import load_file, save_file
from typing import Optional
def compute_task_vectors(
base_path: str,
finetuned_paths: list[str],
) -> list[dict[str, torch.Tensor]]:
"""Compute task vectors for a list of fine-tuned models."""
base = {k: v.float() for k, v in load_file(base_path).items()}
task_vectors = []
for path in finetuned_paths:
ft = load_file(path)
tv = {k: ft[k].float() - base[k] for k in base if k in ft}
task_vectors.append(tv)
return task_vectors
def ties_trim(
task_vector: dict[str, torch.Tensor],
keep_ratio: float = 0.2,
) -> dict[str, torch.Tensor]:
"""
Step 1: Trim - zero out all but the top keep_ratio fraction of delta weights.
Parameters
----------
keep_ratio : float
Fraction of parameters to keep (by absolute magnitude). Default: top 20%.
"""
trimmed = {}
for key, delta in task_vector.items():
flat = delta.abs().flatten()
# Find the threshold: the (1 - keep_ratio) quantile
k = max(1, int((1 - keep_ratio) * flat.numel()))
threshold = flat.kthvalue(k).values.item()
mask = delta.abs() >= threshold
trimmed[key] = delta * mask.float()
return trimmed
def ties_elect_sign(
trimmed_vectors: list[dict[str, torch.Tensor]],
) -> dict[str, torch.Tensor]:
"""
Step 2: Elect - compute the majority-vote sign for each parameter.
The elected sign is the sign of the sum of all trimmed deltas,
which captures both direction and magnitude.
"""
elected = {}
all_keys = set()
for tv in trimmed_vectors:
all_keys |= set(tv.keys())
for key in all_keys:
# Sum of trimmed deltas (weighted by magnitude = soft majority vote)
total = sum(tv.get(key, torch.zeros(1)) for tv in trimmed_vectors)
elected[key] = torch.sign(total)
return elected
def ties_disjoint_merge(
trimmed_vectors: list[dict[str, torch.Tensor]],
elected_signs: dict[str, torch.Tensor],
) -> dict[str, torch.Tensor]:
"""
Step 3: Disjoint merge - average only sign-aligned updates per parameter.
For each parameter, only models whose trimmed delta agrees with the
elected sign contribute to the final merged vector.
"""
merged = {}
all_keys = set()
for tv in trimmed_vectors:
all_keys |= set(tv.keys())
for key in all_keys:
gamma = elected_signs[key] # shape: same as parameter
total = torch.zeros_like(gamma)
count = torch.zeros_like(gamma)
for tv in trimmed_vectors:
delta = tv.get(key, None)
if delta is None:
continue
# Agreement: trimmed delta is nonzero AND has the elected sign
agree_mask = (delta != 0) & (torch.sign(delta) == gamma)
total += delta * agree_mask.float()
count += agree_mask.float()
# Avoid division by zero (parameters where no model agreed)
merged[key] = torch.where(count > 0, total / count, torch.zeros_like(total))
return merged
def ties_merge(
base_path: str,
finetuned_paths: list[str],
output_path: str,
keep_ratio: float = 0.2,
lambda_coeff: float = 1.0,
dtype: torch.dtype = torch.bfloat16,
) -> None:
"""
Full TIES merging pipeline.
Parameters
----------
base_path : Path to base model safetensors file.
finetuned_paths : Paths to fine-tuned model safetensors files.
output_path : Where to save the merged model.
keep_ratio : Fraction of delta weights to keep in trimming step (default: 0.2).
lambda_coeff : Scaling factor applied to the merged task vector (default: 1.0).
dtype : Output dtype (default: bfloat16).
"""
print(f"TIES Merging: {len(finetuned_paths)} models")
print(f" keep_ratio={keep_ratio}, lambda={lambda_coeff}")
# Load base model
base = {k: v.float() for k, v in load_file(base_path).items()}
# Step 0: Compute raw task vectors
print("Computing task vectors...")
task_vectors = compute_task_vectors(base_path, finetuned_paths)
# Step 1: Trim
print(f"Trimming (keeping top {keep_ratio*100:.0f}% by magnitude)...")
trimmed = [ties_trim(tv, keep_ratio) for tv in task_vectors]
# Step 2: Elect signs
print("Electing signs (majority vote)...")
elected = ties_elect_sign(trimmed)
# Step 3: Disjoint merge
print("Disjoint merging (averaging sign-aligned updates)...")
merged_vector = ties_disjoint_merge(trimmed, elected)
# Apply merged vector to base model
result = {}
for key in base:
delta = merged_vector.get(key, torch.zeros_like(base[key]))
result[key] = (base[key] + lambda_coeff * delta).to(dtype)
save_file(result, output_path)
print(f"TIES merged model saved to {output_path}")
# Usage example:
# ties_merge(
# base_path="models/llama3-8b-base.safetensors",
# finetuned_paths=[
# "models/llama3-technical-writing.safetensors",
# "models/llama3-code-generation.safetensors",
# "models/llama3-multilingual.safetensors",
# ],
# output_path="models/llama3-ties-merged.safetensors",
# keep_ratio=0.2,
# lambda_coeff=0.8,
# )
Why the Trim Step Works - Mathematical Intuition
The trim step is counterintuitive: you're throwing away 80% of the fine-tuning signal. Why doesn't this destroy the model?
The answer comes from the lottery ticket hypothesis applied to fine-tuning. When a model fine-tunes, most parameters change by tiny amounts that represent optimization noise - the gradient of the loss happened to point slightly in their direction, but they're not genuinely important for the new task. A small subset of parameters change by large amounts because they genuinely encode the new capability.
Experiments from Yadav et al. support this: after trimming 80% of delta weights, performance on the fine-tuned task degrades by only 1-3% on average. The fine-tuned capability is mostly encoded in the top 20% of parameters by magnitude.
This concentration of signal in a few parameters is also why sign conflicts are so damaging: the parameters that conflict in sign tend to be exactly the high-magnitude parameters that encode the most capability. Simple averaging cancels the most important parameters.
Information Content Analysis
You can verify this empirically by plotting the delta weight magnitude distribution:
import matplotlib.pyplot as plt
import numpy as np
def analyze_delta_distribution(task_vector: dict, n_bins: int = 50) -> None:
"""Analyze and visualize the delta weight magnitude distribution."""
all_deltas = []
for key, delta in task_vector.items():
all_deltas.extend(delta.abs().flatten().tolist())
all_deltas = np.array(all_deltas)
# Compute cumulative variance explained by top-k% parameters
sorted_deltas = np.sort(all_deltas)[::-1]
cumulative_variance = np.cumsum(sorted_deltas**2) / np.sum(sorted_deltas**2)
# Report percentiles
for pct in [1, 5, 10, 20, 50]:
cutoff = int(len(sorted_deltas) * pct / 100)
var_explained = cumulative_variance[cutoff]
print(f"Top {pct:2d}% of parameters account for {var_explained*100:.1f}% of total delta energy")
# Typical output for a 7B model fine-tuned for 1000 steps:
# Top 1% of parameters account for 43.2% of total delta energy
# Top 5% of parameters account for 71.8% of total delta energy
# Top 10% of parameters account for 84.1% of total delta energy
# Top 20% of parameters account for 93.7% of total delta energy
# Top 50% of parameters account for 99.1% of total delta energy
This heavy-tailed distribution means the top 20% of parameters by magnitude contain over 90% of the "delta energy" - the information content of the fine-tuning. Trimming the bottom 80% discards less than 10% of the meaningful update.
TIES vs Simple Task Arithmetic - Benchmark Comparison
Yadav et al. (2023) evaluated TIES against simple task arithmetic on multiple task combinations using T5-base, T5-large, and GPT-2. Key findings:
| Method | 2-Task Merge | 4-Task Merge | 8-Task Merge |
|---|---|---|---|
| Simple averaging | 73.2% | 68.4% | 61.1% |
| Task arithmetic | 74.1% | 70.2% | 63.8% |
| TIES (keep=20%) | 76.8% | 73.9% | 68.5% |
The advantage of TIES grows with the number of models merged. For 2-model merges, simple averaging is often competitive. For 4+ model merges, TIES consistently outperforms.
The intuition: more models means more sign conflicts. TIES's sign-election mechanism becomes more valuable as the number of conflicting updates increases.
Hyperparameter Sensitivity
The two key hyperparameters in TIES are keep_ratio and lambda_coeff:
keep_ratio (Trim Fraction)
| keep_ratio | Effect |
|---|---|
| 0.05–0.10 | Very aggressive trimming; only the most critical parameters; may lose some nuance |
| 0.15–0.25 | Recommended range; good balance of noise removal and signal preservation |
| 0.30–0.50 | Conservative trimming; more signal but more noise included |
| 1.0 | No trimming; equivalent to skipping Step 1 |
The optimal keep_ratio depends on:
- Training duration (more training = more signal concentrated in fewer parameters → lower keep_ratio is fine)
- Learning rate (higher learning rate = larger but noisier deltas → more aggressive trimming helps)
- Number of models being merged (more models → lower keep_ratio reduces interference)
lambda_coeff
TIES's disjoint merge already handles sign conflicts, so lambda_coeff is more stable than in simple task arithmetic. Values between 0.5 and 1.0 typically work well. Higher values preserve more of the task-specific capabilities; lower values stay closer to the base model's distribution.
def ties_hyperparameter_sweep(
base_path: str,
finetuned_paths: list[str],
evaluate_fn,
keep_ratios: list[float] = [0.1, 0.2, 0.3],
lambda_coeffs: list[float] = [0.5, 0.7, 1.0],
) -> dict:
"""Grid search over TIES hyperparameters."""
import tempfile, os
best_result = {"score": -1, "keep_ratio": None, "lambda": None}
for kr in keep_ratios:
for lam in lambda_coeffs:
with tempfile.NamedTemporaryFile(suffix=".safetensors", delete=False) as f:
tmp_path = f.name
try:
ties_merge(base_path, finetuned_paths, tmp_path, kr, lam)
score = evaluate_fn(load_file(tmp_path))
print(f" keep_ratio={kr:.2f}, lambda={lam:.2f} -> score={score:.4f}")
if score > best_result["score"]:
best_result = {"score": score, "keep_ratio": kr, "lambda": lam}
finally:
os.unlink(tmp_path)
print(f"\nBest: {best_result}")
return best_result
When TIES Wins vs When Simple Averaging Is Enough
TIES clearly wins when:
- Merging 3+ models (sign conflicts compound with more models)
- Models are trained on very different tasks (code + translation + math)
- Fine-tuning was extensive (larger deltas mean larger conflicts)
Simple averaging may be competitive when:
- Merging exactly 2 models
- Models are trained on related tasks (writing + summarization)
- Fine-tuning was light (LoRA adapters, small learning rate)
Neither method works when:
- Models have different base checkpoints
- Models use different tokenizers
- Fine-tuning has moved models far outside the original loss basin
Common Mistakes
:::danger Don't set keep_ratio too high expecting better performance Counter-intuitively, keeping more parameters (high keep_ratio) often produces worse TIES merges. The low-magnitude parameters you're keeping are largely noise. A keep_ratio of 0.2 (top 20%) typically beats 0.5. Start low and increase only if you see consistent quality degradation. :::
:::warning The elected sign may be wrong for parameters where models are evenly split If exactly half the models push a parameter positive and half push it negative, the magnitude-weighted sum could go either way. In these cases, the disjoint merge excludes the losing half's contribution, potentially losing useful signal. This is especially problematic with an even number of models. With an odd number of models, strict majority vote is unambiguous. :::
:::tip Apply TIES per layer group, not globally The keep_ratio that works for attention layers may differ from the optimal for MLP layers and embedding layers. Production implementations like MergeKit allow per-layer-type keep_ratio configuration. Experiment with setting lower keep_ratio for the LM head (most sensitive) and higher for middle transformer layers. :::
Interview Q&A
Q: What is the sign conflict problem and why does simple task arithmetic fail to address it?
A: Sign conflicts occur when different fine-tuned models push the same weight parameter in opposite directions. If model A increases a weight by +0.5 and model B decreases it by -0.4, simple averaging gives +0.05 - essentially zero. Neither model's intended update is applied, and both capabilities are degraded. Simple task arithmetic doesn't address this because it sums task vectors without any filtering: conflicting updates cancel each other rather than being handled intelligently.
Q: Walk through the three steps of TIES merging.
A: Step 1, Trim: for each model's task vector, zero out all delta weights except the top- by absolute magnitude (typically top 20%). This removes noisy low-magnitude updates that mostly contribute interference. Step 2, Elect: for each parameter position, compute the sign of the sum of all trimmed deltas. This is a magnitude-weighted majority vote - the direction most supported by large-magnitude updates wins. Step 3, Disjoint merge: for each parameter, average only those models' trimmed deltas that agree with the elected sign. Models whose update conflicts with the elected direction are excluded from contributing to that parameter. The result is applied to the base model with a scaling coefficient.
Q: Why does trimming 80% of delta weights barely hurt performance?
A: Delta weight distributions from fine-tuning are heavily right-skewed: a small fraction of parameters change by large amounts (these genuinely encode the new capability) while the vast majority change by tiny amounts representing optimization noise. Empirically, the top 20% of parameters by absolute magnitude contain 90%+ of the total "delta energy" - the L2 norm of the update. The bottom 80% contribute little to actual task performance but a lot to interference when merging. Yadav et al. confirmed this: trimming 80% of delta weights typically degrades single-task performance by only 1-3%, while dramatically reducing interference in merges.
Q: How does TIES scale as the number of models increases, compared to simple averaging?
A: TIES becomes increasingly advantageous as the number of merged models grows. With 2 models, sign conflicts affect perhaps 15-25% of parameters; with 8 models, nearly every parameter will have at least one conflicting update. Simple averaging compounds these conflicts, producing increasingly degraded results with more models. TIES handles additional models gracefully because the majority vote mechanism becomes more informative with more voters: the winning sign gets stronger majority support, and the disjoint merge more cleanly separates agreeing from disagreeing models. In Yadav et al.'s experiments, the performance gap between TIES and simple averaging grew from ~3% at 2-model merges to ~7% at 8-model merges.
Q: What is the keep_ratio hyperparameter in TIES and how do you choose it?
A: keep_ratio determines what fraction of delta weights survive the trimming step (the rest are zeroed out). A value of 0.2 means "keep the top 20% by absolute magnitude." Lower values are more aggressive (remove more noise but risk losing signal), higher values are more conservative (preserve more signal but keep more noise). The optimal value depends on fine-tuning intensity: models trained for many steps with high learning rates develop more concentrated, high-magnitude deltas, so lower keep_ratio (0.1-0.15) works better. Models fine-tuned with LoRA or for fewer steps have less concentrated deltas, benefiting from higher keep_ratio (0.2-0.3). Always tune keep_ratio on a held-out evaluation set rather than using the default blindly.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Model Merging: TIES, DARE & SLERP demo on the EngineersOfAI Playground - no code required.
:::
