How to use arcee-ai/mergekit to merge language models with YAML configuration, CPU-compatible layer-by-layer processing, and automated HuggingFace Hub upload.

How does arcee-ai work in practice?

MergeKit - The Practical Toolkit covers MergeKit, arcee-ai, model merging tools from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/model-merging/mergekit

What is the difference between MergeKit and model merging tools?

See the full breakdown at https://engineersofai.com/docs/llms/model-merging/mergekit

MergeKit - The Practical Toolkit

The Tool That Made Merging Accessible

Before MergeKit, running TIES or DARE required implementing the algorithms from scratch, handling the loading and saving of large model checkpoints, managing memory carefully to avoid OOM errors, and writing the HuggingFace configuration files correctly. Each merge was a small engineering project.

MergeKit, developed by Charles Goddard at Arcee AI and released as open-source in late 2023, turned model merging into a YAML configuration exercise. You write a configuration file describing which models to merge and how, then run a single command. MergeKit handles everything else: downloading models from HuggingFace Hub, loading them layer by layer to manage memory, applying the correct algorithm, and saving the result in a format compatible with the transformers library.

The open-source community adopted MergeKit almost immediately. Within a few months of release, it was the standard tool for model merging experiments on HuggingFace Hub. Most of the top-ranked models on the Open LLM Leaderboard that are merges were produced with MergeKit.

This lesson covers MergeKit end-to-end: installation, YAML configuration for all supported methods, memory management, and the complete workflow from local merge to published HuggingFace model.

Installation and Setup

# Install from PyPI
pip install mergekit

# Or from source (for latest features)
git clone https://github.com/arcee-ai/mergekit
cd mergekit
pip install -e .

# For CUDA-accelerated merging (optional but recommended for large models)
pip install mergekit[cuda]

# Verify installation
mergekit-yaml --help

MergeKit's dependencies are minimal: torch, transformers, safetensors, pydantic, and huggingface_hub. No special compute is required - all major merge methods work on CPU.

Understanding the Configuration Format

Every MergeKit merge is specified by a YAML file with this general structure:

merge_method: <method>      # linear | ties | dare_linear | dare_ties | slerp | passthrough
base_model: <hub_id_or_path>   # required for most methods

models:
  - model: <hub_id_or_path>
    parameters:
      weight: <float>         # contribution weight (for linear/ties)
      density: <float>        # keep ratio (for dare methods, = 1 - drop_rate)
      # ... other per-model params

parameters:
  # Global parameters that apply to all models
  normalize: <bool>           # normalize weights so they sum to 1
  int8_mask: <bool>           # use 8-bit masks to reduce memory
  # Method-specific parameters vary

dtype: bfloat16               # output dtype

The base_model is the reference model from which task vectors are computed. The models list contains the fine-tuned models being merged (not the base model itself - the base is separate).

Linear Merging

The simplest merge method: weighted average of task vectors.

# linear_merge.yaml
# Weighted average of two instruction-tuned models
merge_method: linear
base_model: meta-llama/Meta-Llama-3-8B

models:
  - model: meta-llama/Meta-Llama-3-8B-Instruct
    parameters:
      weight: 0.6     # contribute 60% to the merge

  - model: NousResearch/Hermes-3-Llama-3.1-8B
    parameters:
      weight: 0.4     # contribute 40%

parameters:
  normalize: false    # don't force weights to sum to 1

dtype: bfloat16

mergekit-yaml linear_merge.yaml ./output-model \
  --copy-tokenizer \
  --allow-crimes   # required for some model combinations

The --copy-tokenizer flag copies the tokenizer files from the base model to the output directory, so the merged model is immediately loadable with AutoTokenizer.from_pretrained.

TIES Merging

# ties_merge.yaml
# TIES merge: resolves sign conflicts across multiple models
merge_method: ties
base_model: mistralai/Mistral-7B-v0.1

models:
  - model: WizardLM/WizardLM-7B-V1.0
    parameters:
      weight: 1.0     # equal weight for all models
      density: 0.2    # keep top 20% of delta weights (= TIES keep_ratio)

  - model: WizardMath/WizardMath-7B-V1.0
    parameters:
      weight: 1.0
      density: 0.2

  - model: WizardCoder-Python-7B-V1.0
    parameters:
      weight: 1.0
      density: 0.2

parameters:
  normalize: true        # normalize weights so they sum to 1

dtype: bfloat16

In MergeKit, density corresponds to TIES's keep_ratio: density=0.2 means keep the top 20% of delta weights by magnitude.

DARE + Linear

# dare_linear.yaml
# DARE preprocessing + linear average
# Good for 2-3 models with some domain overlap
merge_method: dare_linear
base_model: meta-llama/Meta-Llama-3-8B

models:
  - model: meta-llama/Meta-Llama-3-8B-Instruct
    parameters:
      weight: 0.5
      density: 0.1    # = 1 - drop_rate; density=0.1 means drop 90% of deltas

  - model: deepseek-ai/deepseek-coder-7b-instruct-v1.5
    parameters:
      weight: 0.5
      density: 0.1

dtype: bfloat16

The density parameter in DARE methods means "fraction of delta weights to keep" - the complement of the drop rate. density=0.1 → drop 90% → drop_rate=0.9.

DARE + TIES (The Gold Standard for 3+ Models)

# dare_ties_merge.yaml
# DARE preprocessing + TIES sign resolution
# Best choice for merging 3+ diverse models
merge_method: dare_ties
base_model: meta-llama/Meta-Llama-3-8B

models:
  - model: meta-llama/Meta-Llama-3-8B-Instruct
    parameters:
      weight: 1.0
      density: 0.15   # Keep top 15% after DARE drop

  - model: deepseek-ai/deepseek-coder-7b-instruct-v1.5
    parameters:
      weight: 1.0
      density: 0.15

  - model: TIGER-Lab/MathInstruct
    parameters:
      weight: 1.0
      density: 0.15

parameters:
  normalize: true

dtype: bfloat16

SLERP Merging

# slerp_merge.yaml
# Spherical interpolation between exactly two models
merge_method: slerp
base_model: meta-llama/Meta-Llama-3-8B

models:
  - model: meta-llama/Meta-Llama-3-8B-Instruct
    parameters:
      t: 0.0   # Starting point

  - model: NousResearch/Meta-Llama-3-8B-Instruct
    parameters:
      t: 1.0   # Ending point

parameters:
  t: 0.5     # Global blend (only used when per-model t not set)

dtype: bfloat16

For gradient SLERP (different blend per layer):

parameters:
  t:
    # Override t for specific layers by key pattern
    - filter: "model.embed_tokens"
      value: 0.3
    - filter: "model.layers.0"
      value: 0.4
    - filter: "model.layers.31"
      value: 0.6
    - filter: "lm_head"
      value: 0.3
    - value: 0.5   # default for everything else

Passthrough - Layer-Level Frankenmodels

The passthrough method enables building frankenmodels: assembling a model from layers of different source models without any weight averaging.

# frankenmodel.yaml
# Take the first 16 layers from model A and last 16 layers from model B
# (Creates a 32-layer "frankenmodel" from two 32-layer models)
merge_method: passthrough

slices:
  - sources:
      - model: meta-llama/Meta-Llama-3-8B-Instruct
        layer_range: [0, 16]    # First half from instruct model

  - sources:
      - model: deepseek-ai/deepseek-coder-7b-instruct-v1.5
        layer_range: [16, 32]   # Second half from code model

dtype: bfloat16

This is explored further in Lesson 07 (Frankenmodels).

Full Worked Example: Coding + Instruction Following Merge

Let's walk through a complete merge workflow: merging Llama-3-8B-Instruct with a code-specialized model using DARE+TIES.

Step 1: Write the Configuration

# llama3-code-instruct-merge.yaml
merge_method: dare_ties
base_model: meta-llama/Meta-Llama-3-8B

models:
  - model: meta-llama/Meta-Llama-3-8B-Instruct
    parameters:
      weight: 1.0
      density: 0.2

  - model: deepseek-ai/deepseek-coder-7b-instruct-v1.5
    # Note: Different base model! This may cause issues.
    # For production, use models that share the same base.
    parameters:
      weight: 0.7    # Lower weight for the non-base-aligned model
      density: 0.2

parameters:
  normalize: true

dtype: bfloat16
tokenizer_source: base   # Use base model's tokenizer

Step 2: Run the Merge

# Merge to local directory
mergekit-yaml llama3-code-instruct-merge.yaml ./merged-llama3-code-instruct \
  --copy-tokenizer \
  --trust-remote-code \
  --lazy-unpickle          # Use lazy loading to reduce peak RAM

# Options:
# --device cuda          # Use GPU for faster merging (optional)
# --low-cpu-memory       # Extra aggressive memory optimization
# --allow-crimes         # Bypass safety checks (e.g., mismatched configs)
# --verbose              # Detailed progress output

Step 3: Verify the Merged Model

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load merged model
model_path = "./merged-llama3-code-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

def test_model(prompt: str, max_new_tokens: int = 200) -> str:
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.1,
            do_sample=True,
        )
    return tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)

# Test instruction following
print("=== Instruction Following Test ===")
print(test_model("What is the difference between a list and a tuple in Python?"))

# Test code generation
print("\n=== Code Generation Test ===")
print(test_model("Write a Python function that finds the nth Fibonacci number using memoization."))

# Test that base capabilities are preserved
print("\n=== General Knowledge Test ===")
print(test_model("Explain the concept of attention in transformer models."))

Step 4: Evaluate on Benchmarks

# Quick benchmark using lm-evaluation-harness
# pip install lm-eval

import subprocess

benchmarks = [
    "mmlu",           # General knowledge
    "humaneval",      # Code generation
    "gsm8k",          # Math reasoning
    "hellaswag",      # Common sense
]

for benchmark in benchmarks:
    cmd = [
        "lm_eval",
        "--model", "hf",
        "--model_args", f"pretrained={model_path},dtype=bfloat16",
        "--tasks", benchmark,
        "--device", "cuda:0",
        "--batch_size", "4",
        "--output_path", f"./eval-results/{benchmark}",
    ]
    print(f"Running {benchmark}...")
    subprocess.run(cmd, check=True)

Step 5: Upload to HuggingFace Hub

from huggingface_hub import HfApi
import json, os

api = HfApi()

# Create repository
repo_id = "your-username/llama3-8b-code-instruct-dare-ties"
api.create_repo(repo_id=repo_id, repo_type="model", private=False, exist_ok=True)

# Write model card
model_card = """---
language:
  - en
license: llama3
base_model:
  - meta-llama/Meta-Llama-3-8B
  - meta-llama/Meta-Llama-3-8B-Instruct
  - deepseek-ai/deepseek-coder-7b-instruct-v1.5
tags:
  - merge
  - dare
  - ties
  - code
  - instruct
---

# Llama-3-8B Code+Instruct DARE-TIES Merge

A merge of Llama-3-8B-Instruct and DeepSeek-Coder-7B-Instruct using DARE+TIES.

## Merge Configuration

```yaml
merge_method: dare_ties
base_model: meta-llama/Meta-Llama-3-8B

models:
  - model: meta-llama/Meta-Llama-3-8B-Instruct
    parameters:
      weight: 1.0
      density: 0.2

  - model: deepseek-ai/deepseek-coder-7b-instruct-v1.5
    parameters:
      weight: 0.7
      density: 0.2

parameters:
  normalize: true
dtype: bfloat16

Evaluation Results

Benchmark	Score
MMLU (5-shot)	TBD
HumanEval (pass@1)	TBD
GSM8K (8-shot)	TBD

Merged using MergeKit. """

with open(f"{model_path}/README.md", "w") as f: f.write(model_card)

Upload the entire model directory

api.upload_folder( folder_path=model_path, repo_id=repo_id, repo_type="model", ) print(f"Model uploaded to: https://huggingface.co/{repo_id}")

---

## Memory Management - CPU Merging at Scale

MergeKit's most important engineering feature is **lazy layer-by-layer loading**. Instead of loading all models into RAM simultaneously, it processes one layer (or one tensor) at a time:

```mermaid
flowchart TD
    A["Layer 0: embed_tokens<br/>Load from all models"]:::blue
    B["Merge Layer 0<br/>Apply algorithm"]:::green
    C["Save Layer 0<br/>Free RAM"]:::teal
    D["Layer 1: layers.0.attn<br/>Load from all models"]:::blue
    E["Merge Layer 1<br/>Apply algorithm"]:::green
    F["Save Layer 1<br/>Free RAM"]:::teal
    G["...continue for<br/>all 32 layers..."]:::purple
    H["Final: lm_head<br/>Merge and save"]:::orange

    A --> B --> C --> D --> E --> F --> G --> H

    classDef blue fill:#dbeafe,color:#1e293b,stroke:#2563eb
    classDef green fill:#dcfce7,color:#14532d,stroke:#16a34a
    classDef purple fill:#ede9fe,color:#4c1d95,stroke:#7c3aed
    classDef orange fill:#ffedd5,color:#7c2d12,stroke:#ea580c
    classDef teal fill:#ccfbf1,color:#134e4a,stroke:#14b8a6

RAM Requirements

Model Size	Models Merged	Peak RAM (MergeKit lazy)	Peak RAM (naive full load)
7B (BF16)	2	~24 GB	~42 GB
7B (BF16)	3	~28 GB	~56 GB
13B (BF16)	2	~40 GB	~78 GB
70B (BF16)	2	~160 GB	~280 GB

MergeKit's lazy loading reduces peak RAM to approximately (N+1) × single_model_size / N instead of (N+1) × single_model_size, where N is the number of models. For large models, this difference enables merging on commodity hardware that would otherwise run out of memory.

Enabling Lazy Loading

# Default MergeKit is already lazy, but you can be more explicit:
mergekit-yaml config.yaml ./output \
  --lazy-unpickle \
  --low-cpu-memory \
  --allow-crimes

# For extremely constrained environments (e.g., 32GB RAM for 7B models):
mergekit-yaml config.yaml ./output \
  --lazy-unpickle \
  --low-cpu-memory \
  --write-model-card    # Automatically write a model card

Advanced Configuration Patterns

Per-Layer Configuration with Filter

# different parameters for different layer types
merge_method: ties
base_model: meta-llama/Meta-Llama-3-8B

models:
  - model: model-a
    parameters:
      weight: 1.0
      density:
        - filter: "embed_tokens"
          value: 0.5      # More conservative for embeddings
        - filter: "lm_head"
          value: 0.5      # More conservative for output layer
        - value: 0.15     # Aggressive for middle layers

  - model: model-b
    parameters:
      weight: 1.0
      density:
        - filter: "embed_tokens"
          value: 0.5
        - filter: "lm_head"
          value: 0.5
        - value: 0.15

parameters:
  normalize: true
dtype: bfloat16

Tokenizer Handling

When merging models with different tokenizers (which usually means you shouldn't be merging them), MergeKit provides tokenizer_source to specify which tokenizer to use:

# Only use this if you're sure the weight spaces are compatible
# Despite different tokenizers
tokenizer_source: "union"  # Merge tokenizers (experimental)
# OR
tokenizer_source: "base"   # Use base model's tokenizer (recommended)
# OR
tokenizer_source: "model_a_path"  # Use specific model's tokenizer

Multi-Layer Slicing for Frankenmodels

merge_method: passthrough

slices:
  # Embed tokens from model A
  - sources:
      - model: model-a
        layer_range: [0, 0]

  # Layers 0-15 from model A
  - sources:
      - model: model-a
        layer_range: [1, 17]

  # Layers 16-31 from model B
  - sources:
      - model: model-b
        layer_range: [17, 33]

  # Norm and LM head from model A
  - sources:
      - model: model-a
        layer_range: [33, 35]

Python API for Programmatic Use

For integration into automated workflows, MergeKit provides a Python API:

from mergekit.config.models import MergeConfiguration
from mergekit.merge import MergeOptions, run_merge
import yaml

# Define configuration programmatically
config_dict = {
    "merge_method": "dare_ties",
    "base_model": "meta-llama/Meta-Llama-3-8B",
    "models": [
        {
            "model": "meta-llama/Meta-Llama-3-8B-Instruct",
            "parameters": {"weight": 1.0, "density": 0.2}
        },
        {
            "model": "deepseek-ai/deepseek-coder-7b-instruct-v1.5",
            "parameters": {"weight": 0.7, "density": 0.2}
        }
    ],
    "parameters": {"normalize": True},
    "dtype": "bfloat16"
}

# Or load from YAML
with open("my_merge_config.yaml") as f:
    config_dict = yaml.safe_load(f)

# Parse and validate configuration
config = MergeConfiguration.model_validate(config_dict)

# Run merge
options = MergeOptions(
    cuda=False,           # CPU merge
    copy_tokenizer=True,
    lazy_unpickle=True,
    low_cpu_memory=True,
)

run_merge(config, out_path="./merged-model", options=options)
print("Merge complete!")

Automated Hyperparameter Search

import itertools
import json
from pathlib import Path
import subprocess

def run_mergekit_sweep(
    base_config: dict,
    sweep_params: dict,
    eval_command: str,
    output_dir: str = "./sweep-results",
):
    """
    Run a grid search over MergeKit hyperparameters.

    sweep_params: {param_path: [values]}
    Example: {"models.0.parameters.density": [0.1, 0.2, 0.3],
               "models.1.parameters.density": [0.1, 0.2]}
    """
    Path(output_dir).mkdir(exist_ok=True)
    results = []

    param_names = list(sweep_params.keys())
    param_values = list(sweep_params.values())

    for combo in itertools.product(*param_values):
        # Build config for this combo
        config = base_config.copy()
        label_parts = []

        for param_path, value in zip(param_names, combo):
            # Set nested parameter
            keys = param_path.split(".")
            obj = config
            for key in keys[:-1]:
                if key.isdigit():
                    obj = obj[int(key)]
                else:
                    obj = obj[key]
            last_key = keys[-1]
            if last_key.isdigit():
                obj[int(last_key)] = value
            else:
                obj[last_key] = value
            label_parts.append(f"{keys[-1]}={value}")

        label = "_".join(label_parts)
        model_dir = f"{output_dir}/{label}"

        # Write config file
        config_path = f"{output_dir}/{label}_config.yaml"
        with open(config_path, "w") as f:
            import yaml
            yaml.dump(config, f)

        # Run merge
        print(f"\nRunning: {label}")
        subprocess.run(
            ["mergekit-yaml", config_path, model_dir, "--lazy-unpickle"],
            check=True
        )

        # Evaluate
        score = float(
            subprocess.check_output(
                eval_command.replace("{model_dir}", model_dir),
                shell=True
            ).decode().strip()
        )

        result = {"label": label, "score": score, "params": dict(zip(param_names, combo))}
        results.append(result)
        print(f"  Score: {score:.4f}")

        # Save running results
        with open(f"{output_dir}/results.json", "w") as f:
            json.dump(sorted(results, key=lambda x: -x["score"]), f, indent=2)

    best = max(results, key=lambda x: x["score"])
    print(f"\nBest configuration: {best['label']} | score={best['score']:.4f}")
    return results

Common Mistakes

:::danger Don't forget --copy-tokenizer Without --copy-tokenizer, the merged model directory won't contain tokenizer files. The model will fail to load with AutoTokenizer.from_pretrained. Always include --copy-tokenizer. :::

:::warning density in MergeKit is the inverse of drop_rate In DARE/DARE+TIES configurations, density means "fraction of delta weights to KEEP" - it's 1 - drop_rate. density=0.2 keeps the top 20%, dropping 80%. density=0.1 drops 90%. This is the opposite of the drop_rate you'll see in research papers. Don't confuse them. :::

:::danger Models must share the same base for TIES/DARE methods MergeKit won't always error out if you provide models with different bases - it will often complete the merge, but the result will be poor. The --allow-crimes flag bypasses safety checks that are there for good reason. Only use it when you know what you're doing. :::

:::tip Use --lazy-unpickle by default for any model above 7B The default MergeKit loading behavior may attempt to load all models into RAM simultaneously before processing. For models above 7B, always use --lazy-unpickle to enable layer-by-layer loading. It's slightly slower but prevents OOM errors. :::

Interview Q&A

Q: What does MergeKit's density parameter mean and how does it relate to DARE and TIES?

A: In MergeKit, density specifies the fraction of delta weights to keep. It equals 1 - drop_rate. For DARE methods, density=0.1 means drop 90% of delta weights (high sparsification). For TIES methods, density=0.2 means keep the top 20% by magnitude (equivalent to TIES keep_ratio=0.2). The parameter controls aggressiveness of sparsification - lower density means more aggressive pruning of delta weights before merging.

Q: Why can MergeKit merge models on CPU, and what are the memory implications?

A: MergeKit uses lazy layer-by-layer loading: it processes one weight tensor at a time, loading it from all source models, merging it, writing the result, and freeing the memory before processing the next tensor. This means peak RAM is proportional to the size of one layer times the number of models, not the total size of all models simultaneously. For a 7B model merged from 2 sources, peak RAM is roughly 24GB instead of 42GB. This enables CPU merging on workstations that would otherwise need an H100 cluster.

Q: How would you configure a MergeKit merge to use different blend ratios for different layer types?

A: MergeKit supports filter-based per-layer parameter configuration. Within the parameters block of a model entry, instead of a scalar value, you provide a list of filter-value pairs. Each filter is a substring matched against parameter names (e.g., "embed_tokens", "lm_head", "attn", "mlp"). The first matching filter's value is used for that parameter. For example, you can set density=0.5 for embedding layers (more conservative) and density=0.15 for middle transformer layers (more aggressive). A catch-all default value handles unmatched layers.

Q: What is the passthrough merge method in MergeKit and when would you use it?

A: Passthrough is a special method that copies layers directly from source models without any weight averaging - it enables building frankenmodels where different layers come from different source models. You specify slices with layer_range for each source model. This is useful for creating models that are larger than either source (by duplicating or inserting layers) or for experimenting with which layers contribute which capabilities. It's distinct from the other methods because it doesn't average weights - it selects complete layers.

Q: How would you run a hyperparameter search with MergeKit to find the optimal density and weight values?

A: The systematic approach is to write a Python script that generates multiple YAML configurations (varying density and weight values), runs mergekit-yaml for each via subprocess, evaluates each merged model on a held-out benchmark, and saves the results. For a 7B model, each merge takes 10-30 minutes on CPU (or 2-5 minutes on GPU). A typical sweep over density=[0.1, 0.2, 0.3] × weights=[0.5/0.5, 0.6/0.4, 0.7/0.3] requires 9 runs. MergeKit also provides a Python API (via MergeConfiguration and run_merge) that can be integrated directly into evaluation pipelines.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Model Merging: TIES, DARE & SLERP demo on the EngineersOfAI Playground - no code required.

:::

The Tool That Made Merging Accessible​

Installation and Setup​

Understanding the Configuration Format​

Linear Merging​

TIES Merging​

DARE + Linear​

DARE + TIES (The Gold Standard for 3+ Models)​

SLERP Merging​

Passthrough - Layer-Level Frankenmodels​

Full Worked Example: Coding + Instruction Following Merge​

Step 1: Write the Configuration​

Step 2: Run the Merge​

Step 3: Verify the Merged Model​

Step 4: Evaluate on Benchmarks​

Step 5: Upload to HuggingFace Hub​

Evaluation Results​