MergeKit - The Practical Toolkit
The Tool That Made Merging Accessible
Before MergeKit, running TIES or DARE required implementing the algorithms from scratch, handling the loading and saving of large model checkpoints, managing memory carefully to avoid OOM errors, and writing the HuggingFace configuration files correctly. Each merge was a small engineering project.
MergeKit, developed by Charles Goddard at Arcee AI and released as open-source in late 2023, turned model merging into a YAML configuration exercise. You write a configuration file describing which models to merge and how, then run a single command. MergeKit handles everything else: downloading models from HuggingFace Hub, loading them layer by layer to manage memory, applying the correct algorithm, and saving the result in a format compatible with the transformers library.
The open-source community adopted MergeKit almost immediately. Within a few months of release, it was the standard tool for model merging experiments on HuggingFace Hub. Most of the top-ranked models on the Open LLM Leaderboard that are merges were produced with MergeKit.
This lesson covers MergeKit end-to-end: installation, YAML configuration for all supported methods, memory management, and the complete workflow from local merge to published HuggingFace model.
Installation and Setup
# Install from PyPI
pip install mergekit
# Or from source (for latest features)
git clone https://github.com/arcee-ai/mergekit
cd mergekit
pip install -e .
# For CUDA-accelerated merging (optional but recommended for large models)
pip install mergekit[cuda]
# Verify installation
mergekit-yaml --help
MergeKit's dependencies are minimal: torch, transformers, safetensors, pydantic, and huggingface_hub. No special compute is required - all major merge methods work on CPU.
Understanding the Configuration Format
Every MergeKit merge is specified by a YAML file with this general structure:
merge_method: <method> # linear | ties | dare_linear | dare_ties | slerp | passthrough
base_model: <hub_id_or_path> # required for most methods
models:
- model: <hub_id_or_path>
parameters:
weight: <float> # contribution weight (for linear/ties)
density: <float> # keep ratio (for dare methods, = 1 - drop_rate)
# ... other per-model params
parameters:
# Global parameters that apply to all models
normalize: <bool> # normalize weights so they sum to 1
int8_mask: <bool> # use 8-bit masks to reduce memory
# Method-specific parameters vary
dtype: bfloat16 # output dtype
The base_model is the reference model from which task vectors are computed. The models list contains the fine-tuned models being merged (not the base model itself - the base is separate).
Linear Merging
The simplest merge method: weighted average of task vectors.
# linear_merge.yaml
# Weighted average of two instruction-tuned models
merge_method: linear
base_model: meta-llama/Meta-Llama-3-8B
models:
- model: meta-llama/Meta-Llama-3-8B-Instruct
parameters:
weight: 0.6 # contribute 60% to the merge
- model: NousResearch/Hermes-3-Llama-3.1-8B
parameters:
weight: 0.4 # contribute 40%
parameters:
normalize: false # don't force weights to sum to 1
dtype: bfloat16
mergekit-yaml linear_merge.yaml ./output-model \
--copy-tokenizer \
--allow-crimes # required for some model combinations
The --copy-tokenizer flag copies the tokenizer files from the base model to the output directory, so the merged model is immediately loadable with AutoTokenizer.from_pretrained.
TIES Merging
# ties_merge.yaml
# TIES merge: resolves sign conflicts across multiple models
merge_method: ties
base_model: mistralai/Mistral-7B-v0.1
models:
- model: WizardLM/WizardLM-7B-V1.0
parameters:
weight: 1.0 # equal weight for all models
density: 0.2 # keep top 20% of delta weights (= TIES keep_ratio)
- model: WizardMath/WizardMath-7B-V1.0
parameters:
weight: 1.0
density: 0.2
- model: WizardCoder-Python-7B-V1.0
parameters:
weight: 1.0
density: 0.2
parameters:
normalize: true # normalize weights so they sum to 1
dtype: bfloat16
In MergeKit, density corresponds to TIES's keep_ratio: density=0.2 means keep the top 20% of delta weights by magnitude.
DARE + Linear
# dare_linear.yaml
# DARE preprocessing + linear average
# Good for 2-3 models with some domain overlap
merge_method: dare_linear
base_model: meta-llama/Meta-Llama-3-8B
models:
- model: meta-llama/Meta-Llama-3-8B-Instruct
parameters:
weight: 0.5
density: 0.1 # = 1 - drop_rate; density=0.1 means drop 90% of deltas
- model: deepseek-ai/deepseek-coder-7b-instruct-v1.5
parameters:
weight: 0.5
density: 0.1
dtype: bfloat16
The density parameter in DARE methods means "fraction of delta weights to keep" - the complement of the drop rate. density=0.1 → drop 90% → drop_rate=0.9.
DARE + TIES (The Gold Standard for 3+ Models)
# dare_ties_merge.yaml
# DARE preprocessing + TIES sign resolution
# Best choice for merging 3+ diverse models
merge_method: dare_ties
base_model: meta-llama/Meta-Llama-3-8B
models:
- model: meta-llama/Meta-Llama-3-8B-Instruct
parameters:
weight: 1.0
density: 0.15 # Keep top 15% after DARE drop
- model: deepseek-ai/deepseek-coder-7b-instruct-v1.5
parameters:
weight: 1.0
density: 0.15
- model: TIGER-Lab/MathInstruct
parameters:
weight: 1.0
density: 0.15
parameters:
normalize: true
dtype: bfloat16
SLERP Merging
# slerp_merge.yaml
# Spherical interpolation between exactly two models
merge_method: slerp
base_model: meta-llama/Meta-Llama-3-8B
models:
- model: meta-llama/Meta-Llama-3-8B-Instruct
parameters:
t: 0.0 # Starting point
- model: NousResearch/Meta-Llama-3-8B-Instruct
parameters:
t: 1.0 # Ending point
parameters:
t: 0.5 # Global blend (only used when per-model t not set)
dtype: bfloat16
For gradient SLERP (different blend per layer):
parameters:
t:
# Override t for specific layers by key pattern
- filter: "model.embed_tokens"
value: 0.3
- filter: "model.layers.0"
value: 0.4
- filter: "model.layers.31"
value: 0.6
- filter: "lm_head"
value: 0.3
- value: 0.5 # default for everything else
Passthrough - Layer-Level Frankenmodels
The passthrough method enables building frankenmodels: assembling a model from layers of different source models without any weight averaging.
# frankenmodel.yaml
# Take the first 16 layers from model A and last 16 layers from model B
# (Creates a 32-layer "frankenmodel" from two 32-layer models)
merge_method: passthrough
slices:
- sources:
- model: meta-llama/Meta-Llama-3-8B-Instruct
layer_range: [0, 16] # First half from instruct model
- sources:
- model: deepseek-ai/deepseek-coder-7b-instruct-v1.5
layer_range: [16, 32] # Second half from code model
dtype: bfloat16
This is explored further in Lesson 07 (Frankenmodels).
Full Worked Example: Coding + Instruction Following Merge
Let's walk through a complete merge workflow: merging Llama-3-8B-Instruct with a code-specialized model using DARE+TIES.
Step 1: Write the Configuration
# llama3-code-instruct-merge.yaml
merge_method: dare_ties
base_model: meta-llama/Meta-Llama-3-8B
models:
- model: meta-llama/Meta-Llama-3-8B-Instruct
parameters:
weight: 1.0
density: 0.2
- model: deepseek-ai/deepseek-coder-7b-instruct-v1.5
# Note: Different base model! This may cause issues.
# For production, use models that share the same base.
parameters:
weight: 0.7 # Lower weight for the non-base-aligned model
density: 0.2
parameters:
normalize: true
dtype: bfloat16
tokenizer_source: base # Use base model's tokenizer
Step 2: Run the Merge
# Merge to local directory
mergekit-yaml llama3-code-instruct-merge.yaml ./merged-llama3-code-instruct \
--copy-tokenizer \
--trust-remote-code \
--lazy-unpickle # Use lazy loading to reduce peak RAM
# Options:
# --device cuda # Use GPU for faster merging (optional)
# --low-cpu-memory # Extra aggressive memory optimization
# --allow-crimes # Bypass safety checks (e.g., mismatched configs)
# --verbose # Detailed progress output
Step 3: Verify the Merged Model
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load merged model
model_path = "./merged-llama3-code-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
device_map="auto",
)
def test_model(prompt: str, max_new_tokens: int = 200) -> str:
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
temperature=0.1,
do_sample=True,
)
return tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
# Test instruction following
print("=== Instruction Following Test ===")
print(test_model("What is the difference between a list and a tuple in Python?"))
# Test code generation
print("\n=== Code Generation Test ===")
print(test_model("Write a Python function that finds the nth Fibonacci number using memoization."))
# Test that base capabilities are preserved
print("\n=== General Knowledge Test ===")
print(test_model("Explain the concept of attention in transformer models."))
Step 4: Evaluate on Benchmarks
# Quick benchmark using lm-evaluation-harness
# pip install lm-eval
import subprocess
benchmarks = [
"mmlu", # General knowledge
"humaneval", # Code generation
"gsm8k", # Math reasoning
"hellaswag", # Common sense
]
for benchmark in benchmarks:
cmd = [
"lm_eval",
"--model", "hf",
"--model_args", f"pretrained={model_path},dtype=bfloat16",
"--tasks", benchmark,
"--device", "cuda:0",
"--batch_size", "4",
"--output_path", f"./eval-results/{benchmark}",
]
print(f"Running {benchmark}...")
subprocess.run(cmd, check=True)
Step 5: Upload to HuggingFace Hub
from huggingface_hub import HfApi
import json, os
api = HfApi()
# Create repository
repo_id = "your-username/llama3-8b-code-instruct-dare-ties"
api.create_repo(repo_id=repo_id, repo_type="model", private=False, exist_ok=True)
# Write model card
model_card = """---
language:
- en
license: llama3
base_model:
- meta-llama/Meta-Llama-3-8B
- meta-llama/Meta-Llama-3-8B-Instruct
- deepseek-ai/deepseek-coder-7b-instruct-v1.5
tags:
- merge
- dare
- ties
- code
- instruct
---
# Llama-3-8B Code+Instruct DARE-TIES Merge
A merge of Llama-3-8B-Instruct and DeepSeek-Coder-7B-Instruct using DARE+TIES.
## Merge Configuration
```yaml
merge_method: dare_ties
base_model: meta-llama/Meta-Llama-3-8B
models:
- model: meta-llama/Meta-Llama-3-8B-Instruct
parameters:
weight: 1.0
density: 0.2
- model: deepseek-ai/deepseek-coder-7b-instruct-v1.5
parameters:
weight: 0.7
density: 0.2
parameters:
normalize: true
dtype: bfloat16
Evaluation Results
| Benchmark | Score |
|---|---|
| MMLU (5-shot) | TBD |
| HumanEval (pass@1) | TBD |
| GSM8K (8-shot) | TBD |
Merged using MergeKit. """
with open(f"{model_path}/README.md", "w") as f: f.write(model_card)
Upload the entire model directory
api.upload_folder( folder_path=model_path, repo_id=repo_id, repo_type="model", ) print(f"Model uploaded to: https://huggingface.co/{repo_id}")
---
## Memory Management - CPU Merging at Scale
MergeKit's most important engineering feature is **lazy layer-by-layer loading**. Instead of loading all models into RAM simultaneously, it processes one layer (or one tensor) at a time:
```mermaid
flowchart TD
A["Layer 0: embed_tokens<br/>Load from all models"]:::blue
B["Merge Layer 0<br/>Apply algorithm"]:::green
C["Save Layer 0<br/>Free RAM"]:::teal
D["Layer 1: layers.0.attn<br/>Load from all models"]:::blue
E["Merge Layer 1<br/>Apply algorithm"]:::green
F["Save Layer 1<br/>Free RAM"]:::teal
G["...continue for<br/>all 32 layers..."]:::purple
H["Final: lm_head<br/>Merge and save"]:::orange
A --> B --> C --> D --> E --> F --> G --> H
classDef blue fill:#dbeafe,color:#1e293b,stroke:#2563eb
classDef green fill:#dcfce7,color:#14532d,stroke:#16a34a
classDef purple fill:#ede9fe,color:#4c1d95,stroke:#7c3aed
classDef orange fill:#ffedd5,color:#7c2d12,stroke:#ea580c
classDef teal fill:#ccfbf1,color:#134e4a,stroke:#14b8a6
RAM Requirements
| Model Size | Models Merged | Peak RAM (MergeKit lazy) | Peak RAM (naive full load) |
|---|---|---|---|
| 7B (BF16) | 2 | ~24 GB | ~42 GB |
| 7B (BF16) | 3 | ~28 GB | ~56 GB |
| 13B (BF16) | 2 | ~40 GB | ~78 GB |
| 70B (BF16) | 2 | ~160 GB | ~280 GB |
MergeKit's lazy loading reduces peak RAM to approximately (N+1) × single_model_size / N instead of (N+1) × single_model_size, where N is the number of models. For large models, this difference enables merging on commodity hardware that would otherwise run out of memory.
Enabling Lazy Loading
# Default MergeKit is already lazy, but you can be more explicit:
mergekit-yaml config.yaml ./output \
--lazy-unpickle \
--low-cpu-memory \
--allow-crimes
# For extremely constrained environments (e.g., 32GB RAM for 7B models):
mergekit-yaml config.yaml ./output \
--lazy-unpickle \
--low-cpu-memory \
--write-model-card # Automatically write a model card
Advanced Configuration Patterns
Per-Layer Configuration with Filter
# different parameters for different layer types
merge_method: ties
base_model: meta-llama/Meta-Llama-3-8B
models:
- model: model-a
parameters:
weight: 1.0
density:
- filter: "embed_tokens"
value: 0.5 # More conservative for embeddings
- filter: "lm_head"
value: 0.5 # More conservative for output layer
- value: 0.15 # Aggressive for middle layers
- model: model-b
parameters:
weight: 1.0
density:
- filter: "embed_tokens"
value: 0.5
- filter: "lm_head"
value: 0.5
- value: 0.15
parameters:
normalize: true
dtype: bfloat16
Tokenizer Handling
When merging models with different tokenizers (which usually means you shouldn't be merging them), MergeKit provides tokenizer_source to specify which tokenizer to use:
# Only use this if you're sure the weight spaces are compatible
# Despite different tokenizers
tokenizer_source: "union" # Merge tokenizers (experimental)
# OR
tokenizer_source: "base" # Use base model's tokenizer (recommended)
# OR
tokenizer_source: "model_a_path" # Use specific model's tokenizer
Multi-Layer Slicing for Frankenmodels
merge_method: passthrough
slices:
# Embed tokens from model A
- sources:
- model: model-a
layer_range: [0, 0]
# Layers 0-15 from model A
- sources:
- model: model-a
layer_range: [1, 17]
# Layers 16-31 from model B
- sources:
- model: model-b
layer_range: [17, 33]
# Norm and LM head from model A
- sources:
- model: model-a
layer_range: [33, 35]
Python API for Programmatic Use
For integration into automated workflows, MergeKit provides a Python API:
from mergekit.config.models import MergeConfiguration
from mergekit.merge import MergeOptions, run_merge
import yaml
# Define configuration programmatically
config_dict = {
"merge_method": "dare_ties",
"base_model": "meta-llama/Meta-Llama-3-8B",
"models": [
{
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
"parameters": {"weight": 1.0, "density": 0.2}
},
{
"model": "deepseek-ai/deepseek-coder-7b-instruct-v1.5",
"parameters": {"weight": 0.7, "density": 0.2}
}
],
"parameters": {"normalize": True},
"dtype": "bfloat16"
}
# Or load from YAML
with open("my_merge_config.yaml") as f:
config_dict = yaml.safe_load(f)
# Parse and validate configuration
config = MergeConfiguration.model_validate(config_dict)
# Run merge
options = MergeOptions(
cuda=False, # CPU merge
copy_tokenizer=True,
lazy_unpickle=True,
low_cpu_memory=True,
)
run_merge(config, out_path="./merged-model", options=options)
print("Merge complete!")
Automated Hyperparameter Search
import itertools
import json
from pathlib import Path
import subprocess
def run_mergekit_sweep(
base_config: dict,
sweep_params: dict,
eval_command: str,
output_dir: str = "./sweep-results",
):
"""
Run a grid search over MergeKit hyperparameters.
sweep_params: {param_path: [values]}
Example: {"models.0.parameters.density": [0.1, 0.2, 0.3],
"models.1.parameters.density": [0.1, 0.2]}
"""
Path(output_dir).mkdir(exist_ok=True)
results = []
param_names = list(sweep_params.keys())
param_values = list(sweep_params.values())
for combo in itertools.product(*param_values):
# Build config for this combo
config = base_config.copy()
label_parts = []
for param_path, value in zip(param_names, combo):
# Set nested parameter
keys = param_path.split(".")
obj = config
for key in keys[:-1]:
if key.isdigit():
obj = obj[int(key)]
else:
obj = obj[key]
last_key = keys[-1]
if last_key.isdigit():
obj[int(last_key)] = value
else:
obj[last_key] = value
label_parts.append(f"{keys[-1]}={value}")
label = "_".join(label_parts)
model_dir = f"{output_dir}/{label}"
# Write config file
config_path = f"{output_dir}/{label}_config.yaml"
with open(config_path, "w") as f:
import yaml
yaml.dump(config, f)
# Run merge
print(f"\nRunning: {label}")
subprocess.run(
["mergekit-yaml", config_path, model_dir, "--lazy-unpickle"],
check=True
)
# Evaluate
score = float(
subprocess.check_output(
eval_command.replace("{model_dir}", model_dir),
shell=True
).decode().strip()
)
result = {"label": label, "score": score, "params": dict(zip(param_names, combo))}
results.append(result)
print(f" Score: {score:.4f}")
# Save running results
with open(f"{output_dir}/results.json", "w") as f:
json.dump(sorted(results, key=lambda x: -x["score"]), f, indent=2)
best = max(results, key=lambda x: x["score"])
print(f"\nBest configuration: {best['label']} | score={best['score']:.4f}")
return results
Common Mistakes
:::danger Don't forget --copy-tokenizer
Without --copy-tokenizer, the merged model directory won't contain tokenizer files. The model will fail to load with AutoTokenizer.from_pretrained. Always include --copy-tokenizer.
:::
:::warning density in MergeKit is the inverse of drop_rate
In DARE/DARE+TIES configurations, density means "fraction of delta weights to KEEP" - it's 1 - drop_rate. density=0.2 keeps the top 20%, dropping 80%. density=0.1 drops 90%. This is the opposite of the drop_rate you'll see in research papers. Don't confuse them.
:::
:::danger Models must share the same base for TIES/DARE methods
MergeKit won't always error out if you provide models with different bases - it will often complete the merge, but the result will be poor. The --allow-crimes flag bypasses safety checks that are there for good reason. Only use it when you know what you're doing.
:::
:::tip Use --lazy-unpickle by default for any model above 7B
The default MergeKit loading behavior may attempt to load all models into RAM simultaneously before processing. For models above 7B, always use --lazy-unpickle to enable layer-by-layer loading. It's slightly slower but prevents OOM errors.
:::
Interview Q&A
Q: What does MergeKit's density parameter mean and how does it relate to DARE and TIES?
A: In MergeKit, density specifies the fraction of delta weights to keep. It equals 1 - drop_rate. For DARE methods, density=0.1 means drop 90% of delta weights (high sparsification). For TIES methods, density=0.2 means keep the top 20% by magnitude (equivalent to TIES keep_ratio=0.2). The parameter controls aggressiveness of sparsification - lower density means more aggressive pruning of delta weights before merging.
Q: Why can MergeKit merge models on CPU, and what are the memory implications?
A: MergeKit uses lazy layer-by-layer loading: it processes one weight tensor at a time, loading it from all source models, merging it, writing the result, and freeing the memory before processing the next tensor. This means peak RAM is proportional to the size of one layer times the number of models, not the total size of all models simultaneously. For a 7B model merged from 2 sources, peak RAM is roughly 24GB instead of 42GB. This enables CPU merging on workstations that would otherwise need an H100 cluster.
Q: How would you configure a MergeKit merge to use different blend ratios for different layer types?
A: MergeKit supports filter-based per-layer parameter configuration. Within the parameters block of a model entry, instead of a scalar value, you provide a list of filter-value pairs. Each filter is a substring matched against parameter names (e.g., "embed_tokens", "lm_head", "attn", "mlp"). The first matching filter's value is used for that parameter. For example, you can set density=0.5 for embedding layers (more conservative) and density=0.15 for middle transformer layers (more aggressive). A catch-all default value handles unmatched layers.
Q: What is the passthrough merge method in MergeKit and when would you use it?
A: Passthrough is a special method that copies layers directly from source models without any weight averaging - it enables building frankenmodels where different layers come from different source models. You specify slices with layer_range for each source model. This is useful for creating models that are larger than either source (by duplicating or inserting layers) or for experimenting with which layers contribute which capabilities. It's distinct from the other methods because it doesn't average weights - it selects complete layers.
Q: How would you run a hyperparameter search with MergeKit to find the optimal density and weight values?
A: The systematic approach is to write a Python script that generates multiple YAML configurations (varying density and weight values), runs mergekit-yaml for each via subprocess, evaluates each merged model on a held-out benchmark, and saves the results. For a 7B model, each merge takes 10-30 minutes on CPU (or 2-5 minutes on GPU). A typical sweep over density=[0.1, 0.2, 0.3] × weights=[0.5/0.5, 0.6/0.4, 0.7/0.3] requires 9 runs. MergeKit also provides a Python API (via MergeConfiguration and run_merge) that can be integrated directly into evaluation pipelines.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Model Merging: TIES, DARE & SLERP demo on the EngineersOfAI Playground - no code required.
:::
