How spherical linear interpolation provides smoother, geometrically correct blending between two model weight configurations than simple linear averaging.

How does spherical linear interpolation work in practice?

SLERP - Spherical Linear Interpolation covers SLERP, spherical linear interpolation, model merging from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/model-merging/slerp

What is the difference between SLERP and model merging?

See the full breakdown at https://engineersofai.com/docs/llms/model-merging/slerp

SLERP - Spherical Linear Interpolation

Why Straight Lines Aren't Always the Best Path

Imagine you're standing at one end of a curved mountain ridge, and your destination is at the other end. There are two ways to get there: walk in a straight line through the mountain (which means tunneling through solid rock), or walk along the ridge itself (which curves around the surface).

The straight-line path is shorter in Euclidean distance, but it passes through the interior of the mountain - which doesn't actually exist for you. The ridge path follows the actual surface that's accessible.

This metaphor describes the difference between LERP (Linear intERPolation) and SLERP (Spherical Linear intERPolation) when applied to model weight merging. Both methods interpolate between two weight configurations, but they travel different paths through parameter space. LERP travels the straight-line chord. SLERP travels along the arc of the sphere defined by the two weight vectors.

For model merging, the sphere matters because model weight vectors exist at certain norms (lengths) that correspond to well-functioning models. The straight-line path dips through the interior - regions of lower norm that may correspond to models that are "smaller" than either source, with a kind of combined-but-diluted character. SLERP stays on the surface, preserving norms throughout the interpolation.

The Mathematics of SLERP

LERP Recap

Standard linear interpolation between vectors $\mathbf{v}_1$ and $\mathbf{v}_2$ :

$\text{LERP}(\mathbf{v}_1, \mathbf{v}_2, t) = (1-t) \cdot \mathbf{v}_1 + t \cdot \mathbf{v}_2$

At $t=0$ you get $\mathbf{v}_1$ ; at $t=1$ you get $\mathbf{v}_2$ ; at $t=0.5$ you get the midpoint. The midpoint has lower norm than either endpoint (it's inside the sphere defined by the endpoints' common norm).

SLERP Formula

Spherical linear interpolation travels along the great circle arc between the two vectors:

$\text{SLERP}(\mathbf{v}_1, \mathbf{v}_2, t) = \frac{\sin((1-t)\Omega)}{\sin(\Omega)} \mathbf{v}_1 + \frac{\sin(t\Omega)}{\sin(\Omega)} \mathbf{v}_2$

Where $\Omega$ is the angle between $\mathbf{v}_1$ and $\mathbf{v}_2$ :

$\Omega = \arccos\left(\frac{\mathbf{v}_1 \cdot \mathbf{v}_2}{\|\mathbf{v}_1\| \cdot \|\mathbf{v}_2\|}\right)$

SLERP preserves the norm of the interpolated vector throughout the path. At any $t$ , the interpolated vector has the same magnitude as the original vectors (when they have equal norm).

Why the Angle Matters

When the two model weight vectors are nearly parallel (small $\Omega$ ), LERP and SLERP give very similar results - the chord and arc are almost identical for small angles. When the vectors are more orthogonal (large $\Omega$ ), SLERP departs significantly from LERP, traveling a longer arc but maintaining norms.

For typical pairs of fine-tuned models from the same base, $\Omega$ is relatively small - the models haven't diverged far from each other. But it's not negligible, and for tasks that are quite different from each other, the angle can be significant.

SLERP Implementation

import torch
import numpy as np
from safetensors.torch import load_file, save_file


def slerp_tensors(
    v1: torch.Tensor,
    v2: torch.Tensor,
    t: float,
    eps: float = 1e-8,
) -> torch.Tensor:
    """
    Spherical linear interpolation between two tensors.

    Parameters
    ----------
    v1, v2 : tensors to interpolate (must be same shape)
    t      : interpolation parameter in [0, 1]
              0 -> v1, 1 -> v2, 0.5 -> equal blend
    eps    : small value to avoid numerical issues with near-parallel vectors

    Returns
    -------
    Interpolated tensor.
    """
    assert v1.shape == v2.shape, "Tensors must have the same shape"
    assert 0.0 <= t <= 1.0, "t must be in [0, 1]"

    # Flatten to 1D for computation
    original_shape = v1.shape
    v1_flat = v1.float().flatten()
    v2_flat = v2.float().flatten()

    # Compute norms
    v1_norm = torch.norm(v1_flat)
    v2_norm = torch.norm(v2_flat)

    # Handle near-zero vectors (fall back to LERP)
    if v1_norm < eps or v2_norm < eps:
        return ((1 - t) * v1_flat + t * v2_flat).reshape(original_shape)

    # Normalize to unit sphere
    v1_unit = v1_flat / v1_norm
    v2_unit = v2_flat / v2_norm

    # Compute the angle between the vectors
    dot = torch.clamp(torch.dot(v1_unit, v2_unit), -1.0, 1.0)
    omega = torch.acos(dot)

    # Handle near-parallel vectors (small angle, sin(omega) ≈ 0)
    if omega.abs() < eps:
        # Fall back to LERP for near-parallel vectors
        # (SLERP is numerically unstable when omega ≈ 0)
        return ((1 - t) * v1_flat + t * v2_flat).reshape(original_shape)

    # SLERP formula
    sin_omega = torch.sin(omega)
    coeff1 = torch.sin((1 - t) * omega) / sin_omega
    coeff2 = torch.sin(t * omega) / sin_omega

    # Interpolate on unit sphere, then scale back
    # We interpolate between the norm-scaled versions
    # For equal-norm vectors, this preserves the original norm
    avg_norm = (1 - t) * v1_norm + t * v2_norm  # linearly interpolate norm
    result_unit = coeff1 * v1_unit + coeff2 * v2_unit
    result = result_unit * avg_norm

    return result.reshape(original_shape)


def slerp_models(
    model_a_path: str,
    model_b_path: str,
    output_path: str,
    t: float = 0.5,
    dtype: torch.dtype = torch.bfloat16,
    verbose: bool = True,
) -> None:
    """
    SLERP merge of exactly two models.

    SLERP operates per-tensor (per layer), not on the concatenated weight vector.
    Each weight tensor is interpolated independently on its own sphere.

    Parameters
    ----------
    model_a_path : path to first model safetensors
    model_b_path : path to second model safetensors
    output_path  : where to save the merged model
    t            : blend factor (0 = all model A, 1 = all model B)
    """
    if verbose:
        print(f"SLERP merge: t={t}")
        print(f"  Model A: {model_a_path}")
        print(f"  Model B: {model_b_path}")

    model_a = load_file(model_a_path)
    model_b = load_file(model_b_path)

    assert set(model_a.keys()) == set(model_b.keys()), \
        "Models must have identical parameter sets (same architecture)"

    result = {}
    angles = []

    for key in model_a:
        v1 = model_a[key].float()
        v2 = model_b[key].float()

        # Track the angle between models for diagnostics
        v1_flat = v1.flatten()
        v2_flat = v2.flatten()
        if torch.norm(v1_flat) > 1e-8 and torch.norm(v2_flat) > 1e-8:
            cos_sim = torch.dot(v1_flat / torch.norm(v1_flat),
                                v2_flat / torch.norm(v2_flat))
            angle = torch.acos(cos_sim.clamp(-1, 1)).item() * 180 / 3.14159
            angles.append(angle)

        result[key] = slerp_tensors(v1, v2, t).to(dtype)

    if verbose and angles:
        avg_angle = sum(angles) / len(angles)
        max_angle = max(angles)
        print(f"  Average angle between models: {avg_angle:.2f}°")
        print(f"  Max angle (most divergent layer): {max_angle:.2f}°")
        print(f"  (Smaller angles = SLERP closer to LERP)")

    save_file(result, output_path)
    if verbose:
        print(f"  SLERP merged model saved to {output_path}")


def slerp_t_sweep(
    model_a_path: str,
    model_b_path: str,
    evaluate_fn,
    t_values: list[float] | None = None,
) -> list[tuple[float, float]]:
    """
    Evaluate SLERP at multiple t values to find optimal blend.
    """
    import tempfile, os
    if t_values is None:
        t_values = [0.3, 0.4, 0.5, 0.6, 0.7]

    results = []
    for t in t_values:
        with tempfile.NamedTemporaryFile(suffix=".safetensors", delete=False) as f:
            tmp_path = f.name
        try:
            slerp_models(model_a_path, model_b_path, tmp_path, t, verbose=False)
            score = evaluate_fn(load_file(tmp_path))
            results.append((t, score))
            print(f"  t={t:.2f} | score={score:.4f}")
        finally:
            os.unlink(tmp_path)

    best_t, best_score = max(results, key=lambda x: x[1])
    print(f"\nBest: t={best_t:.2f} | score={best_score:.4f}")
    return results

LERP vs SLERP - When the Difference Matters

The practical difference between LERP and SLERP for model merging is often small but occasionally significant. Here's when to care:

Measuring the Angle

def measure_model_angle(model_a_path: str, model_b_path: str) -> dict:
    """
    Compute the average angular distance between two models' weight tensors.
    Useful for deciding whether SLERP will differ meaningfully from LERP.
    """
    model_a = load_file(model_a_path)
    model_b = load_file(model_b_path)

    angles_by_layer_type = {}
    all_angles = []

    for key in model_a:
        if key not in model_b:
            continue
        v1 = model_a[key].float().flatten()
        v2 = model_b[key].float().flatten()
        n1, n2 = torch.norm(v1), torch.norm(v2)
        if n1 < 1e-8 or n2 < 1e-8:
            continue

        cos_sim = torch.dot(v1/n1, v2/n2).clamp(-1, 1)
        angle_deg = torch.acos(cos_sim).item() * 180 / 3.14159

        # Categorize by layer type
        if "embed" in key:
            layer_type = "embedding"
        elif "attn" in key or "attention" in key:
            layer_type = "attention"
        elif "mlp" in key:
            layer_type = "mlp"
        elif "norm" in key:
            layer_type = "norm"
        else:
            layer_type = "other"

        angles_by_layer_type.setdefault(layer_type, []).append(angle_deg)
        all_angles.append(angle_deg)

    print(f"Overall average angle: {sum(all_angles)/len(all_angles):.2f}°")
    for layer_type, angles in sorted(angles_by_layer_type.items()):
        avg = sum(angles) / len(angles)
        print(f"  {layer_type:12s}: {avg:.2f}° average")

    # Interpretation
    avg = sum(all_angles) / len(all_angles)
    if avg < 5:
        print("\nVerdict: Very small angles. LERP ≈ SLERP. Use either.")
    elif avg < 15:
        print("\nVerdict: Moderate angles. SLERP likely 1-2% better than LERP.")
    else:
        print("\nVerdict: Large angles. SLERP meaningfully different from LERP. Test both.")

    return {"overall_avg_angle": avg, "by_layer": angles_by_layer_type}

SLERP in Practice - The t Parameter

The interpolation parameter $t$ controls the blend:

t value	Resulting model character
0.0	Identical to model A
0.1-0.3	Mostly model A, slightly influenced by B
0.5	Equal blend - midpoint on the arc
0.7-0.9	Mostly model B, slightly influenced by A
1.0	Identical to model B

The optimal $t$ is almost never exactly 0.5. If model A is stronger at the primary task and model B contributes mostly stylistic properties, you might find $t=0.3$ (30% toward B) works best. Always sweep.

Per-Layer SLERP

An advanced technique: use different $t$ values for different layer groups. The optimal blend ratio may differ between early layers (which encode syntax and basic language) and later layers (which encode task-specific behavior).

def slerp_models_per_layer(
    model_a_path: str,
    model_b_path: str,
    output_path: str,
    t_by_layer_type: dict[str, float] | None = None,
    default_t: float = 0.5,
) -> None:
    """
    SLERP with different blend ratios for different layer types.

    t_by_layer_type: dict mapping layer identifier substring to t value
    Example: {"embed": 0.3, "attn": 0.5, "mlp": 0.6, "norm": 0.5}
    """
    if t_by_layer_type is None:
        t_by_layer_type = {}

    model_a = load_file(model_a_path)
    model_b = load_file(model_b_path)
    result = {}

    for key in model_a:
        # Determine t for this layer
        t = default_t
        for layer_identifier, layer_t in t_by_layer_type.items():
            if layer_identifier in key:
                t = layer_t
                break

        result[key] = slerp_tensors(
            model_a[key].float(),
            model_b[key].float(),
            t
        ).bfloat16()

    save_file(result, output_path)
    print(f"Per-layer SLERP saved to {output_path}")


# Example: blend embeddings conservatively, MLP layers more aggressively
# slerp_models_per_layer(
#     model_a_path="llama3-instruct",
#     model_b_path="llama3-code",
#     output_path="llama3-instruct-code-slerp",
#     t_by_layer_type={
#         "embed_tokens": 0.2,    # Keep mostly from model A (instruct)
#         "lm_head": 0.2,         # Same - output distribution stays close to A
#         "attention": 0.5,       # Equal blend
#         "mlp": 0.6,             # Lean toward model B (code) for MLP
#     },
#     default_t=0.5,
# )

SLERP Limitations - Why It Doesn't Scale to Multiple Models

SLERP's fundamental limitation is that it only works for exactly two models. The formula requires computing the angle between two vectors - with three or more vectors, there's no unique "great circle" to interpolate along.

For multi-model merging (3+ models), SLERP practitioners typically:

Sequential SLERP: SLERP model A and B, then SLERP the result with C, then with D, etc. This is order-dependent and computationally sequential.
SLERP + TIES: Use SLERP to pre-merge pairs of highly compatible models, then TIES to merge the pre-merged pairs. This is a common community approach.
Abandon SLERP: For 3+ diverse models, TIES or DARE+TIES consistently outperforms sequential SLERP. The angular interpolation advantage doesn't compound across multiple sequential merges the way TIES's majority-vote sign election does.

def sequential_slerp(
    model_paths: list[str],
    output_path: str,
    t: float = 0.5,
) -> None:
    """
    Sequential SLERP for multiple models.
    Merges models one at a time: ((A ⊕ B) ⊕ C) ⊕ D ...

    Note: Results are order-dependent. The first pair gets merged, then
    the result is merged with the next model, etc.
    """
    import tempfile, os
    assert len(model_paths) >= 2

    # SLERP first two
    tmp = model_paths[0]
    for i in range(1, len(model_paths)):
        is_last = (i == len(model_paths) - 1)
        if is_last:
            next_tmp = output_path
        else:
            with tempfile.NamedTemporaryFile(suffix=".safetensors", delete=False) as f:
                next_tmp = f.name

        # When merging intermediate result with next model,
        # adjust t to give equal weight to all models
        # After i merges, the running average covers i models
        # Adding the (i+1)th model: weight = 1/(i+1)
        effective_t = 1.0 / (i + 1)

        print(f"  SLERP step {i}: merging running result with model {i+1}, t={effective_t:.2f}")
        slerp_models(tmp, model_paths[i], next_tmp, t=effective_t, verbose=False)

        if i > 1:  # Clean up intermediate files
            try:
                os.unlink(tmp)
            except Exception:
                pass
        tmp = next_tmp

    print(f"Sequential SLERP complete: {output_path}")

SLERP in MergeKit - The Practical Configuration

In MergeKit (Lesson 06), SLERP is configured via YAML:

# MergeKit SLERP configuration
merge_method: slerp
base_model: meta-llama/Meta-Llama-3-8B

models:
  - model: meta-llama/Meta-Llama-3-8B-Instruct
    parameters:
      t: 0.0   # starting model

  - model: some-code-model/code-llama-8b
    parameters:
      t: 1.0   # ending model

parameters:
  t: 0.5     # global t (used when per-model t not specified)

# SLERP with gradient (vary t per layer):
# parameters:
#   t:
#     - filter: "model.embed_tokens"
#       value: 0.3
#     - filter: "model.norm"
#       value: 0.5
#     - value: 0.5   # default

SLERP vs LERP - A Practical Comparison

def compare_slerp_lerp(
    model_a_path: str,
    model_b_path: str,
    evaluate_fn,
    t_values: list[float] = [0.3, 0.5, 0.7],
) -> None:
    """
    Compare SLERP and LERP side by side for a given model pair.
    """
    import tempfile, os

    print("Comparing SLERP vs LERP at multiple blend values:")
    print(f"{'t':>4} | {'LERP Score':>12} | {'SLERP Score':>12} | {'Delta':>8}")
    print("-" * 45)

    for t in t_values:
        # LERP
        with tempfile.NamedTemporaryFile(suffix=".safetensors", delete=False) as f:
            lerp_path = f.name
        lerp_model = {}
        ma = load_file(model_a_path)
        mb = load_file(model_b_path)
        for key in ma:
            lerp_model[key] = ((1 - t) * ma[key].float() + t * mb[key].float()).bfloat16()
        save_file(lerp_model, lerp_path)
        lerp_score = evaluate_fn(load_file(lerp_path))

        # SLERP
        with tempfile.NamedTemporaryFile(suffix=".safetensors", delete=False) as f:
            slerp_path = f.name
        slerp_models(model_a_path, model_b_path, slerp_path, t, verbose=False)
        slerp_score = evaluate_fn(load_file(slerp_path))

        delta = slerp_score - lerp_score
        delta_str = f"+{delta:.4f}" if delta > 0 else f"{delta:.4f}"
        print(f"{t:>4.2f} | {lerp_score:>12.4f} | {slerp_score:>12.4f} | {delta_str:>8}")

        os.unlink(lerp_path)
        os.unlink(slerp_path)

Production Engineering Notes

:::tip SLERP is computationally lightweight Unlike TIES or DARE, SLERP requires no hyperparameter for sparsification - just the blend parameter $t$ . It's also very fast: the per-tensor computation is O(n) where n is the number of parameters. Merging a 7B model with SLERP takes about the same wall time as LERP. :::

:::note Use SLERP for "flavor" merges, TIES for "capability" merges SLERP excels at blending two models that are close in capability but differ in "style" - for example, an instruction-tuned model and a slightly different instruction-tuned model. For merging genuinely different capabilities (code vs math vs multilingual), TIES/DARE are more appropriate. :::

Common Mistakes

:::danger Don't use SLERP to merge more than 2 models directly SLERP is fundamentally a two-point operation. Sequential SLERP (merging pairs iteratively) is order-dependent and often performs worse than TIES for 3+ models. If you're merging 3+ models, use TIES or DARE+TIES. :::

:::warning Watch for numerical instability when vectors are nearly collinear When $\Omega \approx 0$ (vectors almost parallel), the denominator $\sin(\Omega)$ approaches zero, causing numerical instability. Always implement the fallback to LERP for small angles. The threshold should be around $10^{-8}$ in radians. Most production SLERP implementations (including MergeKit) handle this, but if you're implementing from scratch, don't forget it. :::

Interview Q&A

Q: What is the geometric intuition behind SLERP vs LERP for model merging?

A: LERP travels the straight-line chord through parameter space between two models, which passes through regions of lower norm than either endpoint - like tunneling through a mountain. SLERP travels along the arc of the sphere defined by the two weight vectors, maintaining the norm throughout the interpolation. For model weights, the norm represents something about the "scale" or "strength" of the model's representations. By staying on the sphere, SLERP produces intermediate models that aren't "diluted" versions of either source - they represent a genuine blend that stays on the manifold of well-functioning model weights.

Q: When would SLERP give meaningfully better results than LERP?

A: SLERP differs most from LERP when the angle between the two model weight vectors is large - when the models have diverged significantly in parameter space due to training on very different tasks or for many steps. You can measure this: compute the average cosine similarity between each pair of corresponding weight tensors. If the average angle is less than 5°, LERP and SLERP are nearly identical. If the average angle exceeds 15°, SLERP may provide 1-3% better performance. For most fine-tuned models from the same base, the angle is moderate (5-15°), and the choice between LERP and SLERP matters but is not critical.

Q: What is the t parameter in SLERP and how do you tune it?

A: The parameter $t \in [0, 1]$ controls the blend ratio: $t=0$ gives model A, $t=1$ gives model B, and intermediate values travel along the arc between them. The optimal $t$ is rarely 0.5 - it depends on the relative capabilities of the source models and the target task distribution. Always sweep over candidate values (e.g., 0.3, 0.4, 0.5, 0.6, 0.7) on a held-out evaluation set covering both tasks. If one model is clearly superior on the primary task, the optimal $t$ typically leans toward that model (e.g., 0.3-0.4 toward the weaker model rather than 0.5).

Q: Why can't SLERP be applied directly to 3 or more models?

A: SLERP computes interpolation along the great circle arc between exactly two vectors, which requires computing the angle between them - a well-defined quantity for two vectors. With three or more vectors, there is no unique great circle connecting all of them; the geometry of multi-point interpolation on a sphere is not straightforward. Sequential application of SLERP (merge A and B, then merge the result with C) is possible but order-dependent - the result depends on which pair you merge first. For multi-model merging, TIES provides a principled majority-vote approach that handles 3+ models simultaneously without the order-dependence problem.

Q: What is per-layer SLERP and when would you use it?

A: Per-layer SLERP applies different blend values $t$ to different groups of layers instead of using a single global $t$ . This is motivated by the observation that different layer types have different roles: early layers encode basic syntax and common concepts (should blend conservatively), while later layers encode task-specific behavior (can blend more aggressively). Embedding and LM head layers are particularly sensitive to changes. Per-layer SLERP lets you, for example, set $t=0.3$ for embedding layers, $t=0.5$ for attention layers, and $t=0.6$ for MLP layers - preserving the more fundamental representations of model A while allowing more of model B's task-specific character in the output layers.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the SLERP Model Interpolation demo on the EngineersOfAI Playground - no code required.

:::

Why Straight Lines Aren't Always the Best Path​

The Mathematics of SLERP​

LERP Recap​

SLERP Formula​

Why the Angle Matters​

SLERP Implementation​

LERP vs SLERP - When the Difference Matters​

Measuring the Angle​

SLERP in Practice - The t Parameter​

Per-Layer SLERP​

SLERP Limitations - Why It Doesn't Scale to Multiple Models​

SLERP in MergeKit - The Practical Configuration​

SLERP vs LERP - A Practical Comparison​

Production Engineering Notes​

Common Mistakes​

Interview Q&A​