Skip to main content

Frankenmodels and Limitations of Model Merging

When the Community Pushed Beyond Weight Averaging

The model merging community on HuggingFace had a problem with a name. Practitioners who were assembling models by mixing not just weights but entire layers - transplanting the attention blocks from one model into the MLP slots of another - needed something to call what they were building. These weren't averaged models. They were chimeras: models assembled from incompatible parts, held together by the shared transformer architecture, not by shared training.

They called them frankenmodels. And some of them worked astonishingly well.

The most notable example was Goliath-120B, created by AlpineDale in late 2023: a 120B parameter model assembled by interleaving layers from two Llama-2-70B fine-tunes. Not averaging their weights - literally inserting one model's layers between the other's layers, creating a model twice as deep (and larger) than either source. It outperformed both source models on several benchmarks despite receiving no additional training.

This lesson covers the architecturally creative side of model merging - frankenmodels, layer grafting, and depth upscaling - and then confronts the honest limits of everything covered in this module.


Frankenmodels - Layer-Level Assembly

What Distinguishes Frankenmodels from Weight Merging

Standard model merging (TIES, DARE, SLERP) operates at the level of individual weight tensors: each parameter in the merged model is some combination of the corresponding parameters from the source models.

Frankenmodels operate at the level of entire layers: a layer in the merged model may be taken entirely from one source model, with no blending at all. The architecture is the same (transformer blocks), but the layer selection is creative.

Standard merge (e.g., TIES):
Layer 5 of merged model = f(Layer 5 from model A, Layer 5 from model B)

Frankenmodel:
Layer 5 of assembled model = Layer 5 from model A (unmodified)
Layer 6 of assembled model = Layer 6 from model B (unmodified)

This distinction matters: in a frankenmodel, each layer retains exactly the weights and computation of its source model. The only thing the builder controls is which layers go where.

Why Do Frankenmodels Work At All?

The surprising finding is that transformer layers are somewhat modular. A layer from one fine-tune can be "plugged in" to the layer stack of another fine-tune and the resulting model doesn't immediately collapse into garbage - it produces coherent output, often with characteristics of both source models.

The explanation involves the structure of residual connections in transformers. Each transformer layer is an additive contribution to the residual stream:

hl+1=hl+Attention(hl)+MLP(hl)h_{l+1} = h_l + \text{Attention}(h_l) + \text{MLP}(h_l)

If each layer's contribution is relatively small and incremental, then replacing one layer with a compatible (same architecture, similar distribution) layer from another model is a modest perturbation - the residual stream adapts within a few layers, and the model remains functional.

This modularity breaks down when:

  • Source models have very different value ranges for their hidden states
  • Layer normalization parameters differ significantly between sources
  • The layers being mixed are from the early or late parts of the model (most sensitive)

Layer Grafting - Selective Layer Replacement

Layer grafting is the surgical variant: replace specific layers of a model with layers from another model, leaving the rest unchanged. The goal is to transplant specific capabilities.

The Intuition

Research on probing classifiers has shown that different layers of transformer models encode different types of information:

  • Early layers (0-8 in a 32-layer model): Syntax, morphology, basic token relationships
  • Middle layers (8-24): Semantics, factual knowledge, reasoning patterns
  • Late layers (24-32): Task-specific behavior, output format, calibration

If you can identify which layer range encodes the capability you want to transplant, you can graft those layers without disturbing the rest of the model.

MergeKit Implementation

# Layer grafting: replace middle layers of model A with layers from model B
# Hypothesis: middle layers of model B encode better reasoning
merge_method: passthrough

slices:
# Early layers from model A (keep base model's fundamental representations)
- sources:
- model: meta-llama/Meta-Llama-3-8B-Instruct
layer_range: [0, 10] # Layers 0-9 from instruct model

# Middle layers from model B (transplant code model's reasoning layers)
- sources:
- model: deepseek-ai/deepseek-coder-7b-instruct-v1.5
layer_range: [10, 22] # Layers 10-21 from code model

# Late layers from model A (keep instruct model's output behavior)
- sources:
- model: meta-llama/Meta-Llama-3-8B-Instruct
layer_range: [22, 32] # Layers 22-31 from instruct model

This creates a 32-layer model where the middle 12 layers come from the code model, while the framing layers come from the instruct model. The code model's "reasoning" layers are grafted into the instruct model's architecture.

When Layer Grafting Works

def evaluate_layer_graft_candidates(
base_model_path: str,
donor_model_path: str,
layer_ranges_to_test: list[tuple[int, int]],
evaluate_fn,
total_layers: int = 32,
) -> list[dict]:
"""
Test which layer range from the donor model improves a specific task.

Creates a series of graft configurations and evaluates each.
"""
import yaml, subprocess, tempfile, os

results = []

for start, end in layer_ranges_to_test:
# Create MergeKit config
config = {
"merge_method": "passthrough",
"slices": [
{
"sources": [
{"model": base_model_path, "layer_range": [0, start]}
]
} if start > 0 else None,
{
"sources": [
{"model": donor_model_path, "layer_range": [start, end]}
]
},
{
"sources": [
{"model": base_model_path, "layer_range": [end, total_layers]}
]
} if end < total_layers else None,
]
}
# Remove None slices
config["slices"] = [s for s in config["slices"] if s is not None]

with tempfile.NamedTemporaryFile(mode="w", suffix=".yaml", delete=False) as f:
yaml.dump(config, f)
config_path = f.name

with tempfile.TemporaryDirectory() as model_dir:
subprocess.run(
["mergekit-yaml", config_path, model_dir, "--copy-tokenizer"],
check=True, capture_output=True
)
score = evaluate_fn(model_dir)

os.unlink(config_path)
results.append({"layer_range": (start, end), "score": score})
print(f" Layers [{start}, {end}) grafted: score={score:.4f}")

results.sort(key=lambda x: -x["score"])
print(f"\nBest graft: layers {results[0]['layer_range']} | score={results[0]['score']:.4f}")
return results

Depth Upscaling - Building Bigger Models From Smaller Ones

The most audacious use of layer manipulation is depth upscaling: creating a larger model by inserting additional layers into a smaller base model. The new layers are typically copied from elsewhere in the same model, then further fine-tuned.

Why Copied Layers Are a Good Initialization

When you copy a layer and insert it, the residual connections ensure the model still functions correctly immediately after insertion. The copied layer computes a residual contribution similar to the original; the residual stream adapts. The model doesn't collapse.

This is a dramatically better initialization for new layers than random initialization (which would require many training steps to learn) or zero initialization (which would require the network to learn the new layer from scratch).

Solar 10.7B - The Canonical Example

Upstage AI published "SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective Depth Up-Scaling" (2023) describing exactly this approach.

Their procedure (Depth Up-Scaling, DUS):

  1. Start with Llama-2-7B (32 layers)
  2. Create two copies of the model
  3. Remove the last N layers from model 1 (copy without layers 25-32)
  4. Remove the first N layers from model 2 (copy without layers 0-7)
  5. Concatenate: model 1 (layers 0-24) + model 2 (layers 8-31)
  6. Result: a 48-layer model (32 + 32 - 16 overlapping) with 10.7B parameters
  7. Fine-tune the resulting model on instruction data

The key insight: the concatenated model works immediately because the residual stream remains continuous. The early layers from model 1 process input and build representations; the later layers from model 2 (which were trained from the same base) can continue processing those representations coherently.

Solar 10.7B outperformed Llama-2-13B on multiple benchmarks, demonstrating that depth upscaling was a legitimate path to creating better models at lower cost than training from scratch.

MergeKit Implementation of Depth Upscaling

# depth_upscale.yaml
# Solar-style depth upscaling: 7B -> 10.7B equivalent
merge_method: passthrough

slices:
# First copy: early layers (keep 0-23, drop 24-31)
- sources:
- model: meta-llama/Llama-2-7b-hf
layer_range: [0, 24]

# Second copy: late layers (keep 8-31, drop 0-7)
- sources:
- model: meta-llama/Llama-2-7b-hf
layer_range: [8, 32]

After merging, the model requires fine-tuning to learn how to bridge the "seam" between the two copies. Without fine-tuning, performance is degraded because layer 24 from copy 1 feeds into layer 8 from copy 2 - these layers weren't trained to interact this way.


Goliath-120B - The Interleaved Approach

AlpineDale's Goliath-120B took a different approach: instead of concatenating two models, it interleaved layers from two different Llama-2-70B fine-tunes:

  • Model A: Llama-2-70B-chat (80 layers)
  • Model B: xwin-lm-70b-v0.1 (80 layers, different fine-tune)
  • Result: 160-layer model (80 + 80) by alternating: A[0], B[0], A[1], B[1], ...

The result had 120B effective parameters (the embedding matrix was shared, so it wasn't a full 2× count). Despite no fine-tuning after merging, it produced coherent output and scored competitively on benchmarks.

The interleaving approach has a different character than concatenation: rather than "early layers from A, late layers from B," you get alternating representations from both models throughout the stack. Every other layer comes from a different fine-tuning trajectory.


The Fundamental Limits of Model Merging

Now for the honest accounting. Model merging is powerful, but practitioners consistently discover limits that are worth understanding before committing to a merging-based strategy.

Limit 1: New Knowledge Cannot Be Created

Merging can only recombine capabilities that already exist in the source models. If model A doesn't know about your proprietary product's API and model B doesn't either, no merging technique will produce a model that does. Fine-tuning on the actual data is irreplaceable for injecting new factual knowledge.

This sounds obvious but practitioners frequently test merges hoping to "combine" knowledge domains - only to find that neither source model actually had the specialized knowledge they assumed.

Limit 2: Weight Disentanglement Is Incomplete

The task arithmetic framework assumes that capabilities are linearly separable in weight space - that the "code capability" lives in a separable region of the parameter tensor that can be added and subtracted cleanly. This is only approximately true.

In reality, capabilities are entangled. The weights that encode code generation also contribute to general language modeling, to the model's sense of format and structure, to its willingness to follow instructions. When you add a code task vector, you're not just adding code capability - you're adding everything that was learned during code fine-tuning, including changes to the model's baseline behavior.

This entanglement is why merged models sometimes degrade in unexpected ways: you add a code model's task vector and the merged model becomes more verbose or less willing to decline harmful requests, because those behaviors were also encoded in the code model's weight delta.

Limit 3: Merging Degrades With Increasing Divergence

The loss basin argument that makes merging work assumes both models are near each other in weight space. As fine-tuning becomes more extensive - more epochs, higher learning rate, larger dataset - the fine-tuned model moves further from the base. At some point, it has effectively left the original loss basin.

When this happens, simple geometric averaging no longer gives you a low-loss model. TIES and DARE can partially compensate by being selective about which updates to include, but they can't overcome a fundamental loss landscape mismatch.

In practice, this means:

  • LoRA fine-tunes (which make small changes by design) merge better than full fine-tunes
  • Short fine-tuning runs merge better than long ones
  • Low learning rate fine-tunes merge better than high learning rate ones

Limit 4: Evaluation Is Difficult and Surprising

Merged models often have uneven performance profiles. A model that scores 75 on MMLU and 72 on HumanEval individually might, when merged, score 77 on MMLU but drop to 65 on HumanEval. Or it might excel on the standard benchmarks but fail on specific capability evaluations that were never part of the benchmarks.

The open-source community has documented many cases where merged models rank highly on the Open LLM Leaderboard but perform poorly in production on real user queries. Benchmark performance is a noisy signal for merged model quality.

Limit 5: Reproducibility and Governance

Merged models create model provenance challenges:

  • License compliance: If model A has license X and model B has license Y, what license governs the merged model? This is an unresolved legal question.
  • Reproducibility: The same YAML configuration run on two different hardware configurations may produce slightly different results due to floating-point non-determinism.
  • Attribution: A merged model's behaviors come from multiple sources that are difficult to attribute. When something goes wrong in production, tracing the source of the problematic behavior is hard.
  • Safety evaluation: Models with clean safety fine-tuning may have their alignment degraded when merged with less aligned models. Each merge requires fresh safety evaluation.

Future Directions - Where Model Merging Is Heading

Learned Merge Coefficients

Current methods use hand-tuned or heuristic coefficients (TIES keep_ratio, DARE density, SLERP t). The natural next step is learning the merge coefficients from data:

Given a held-out dataset for each task, optimize the per-layer, per-parameter merge coefficients to maximize performance. This is essentially meta-learning for model merging. Early work (e.g., "Model Merging by Uncertainty-Based Gradient Matching," 2023) has shown promise.

Evolutionary Model Merging

Sakana AI's work on "Evolutionary Optimization of Model Merging Recipes" (2024) uses evolutionary algorithms to search the space of merge configurations: which models to merge, which layers to use, which coefficients to apply. The search is guided by task performance. This removes the need for human-designed merge recipes but requires significant compute for the search process.

Gradient-Based Activation Matching

Rather than matching weights directly, some work proposes matching the activations produced by different models. Given an input, find the merge that makes the merged model's layer activations closest to a weighted average of the source models' activations. This is more computationally expensive but potentially more precise about what "combining capabilities" actually means.


Decision Framework for Model Merging


Common Mistakes

:::danger Treating high benchmark scores as a production green light The Open LLM Leaderboard (MMLU, ARC, HellaSwag, TruthfulQA, Winogrande, GSM8K) measures specific skills that don't necessarily generalize to production query distributions. Many highly-ranked merged models have documented failure modes on actual user queries. Always evaluate on your own production distribution, not just public benchmarks. :::

:::warning Frankenmodels require fine-tuning after assembly Assembling a frankenmodel by concatenating or interleaving layers creates a model where adjacent layers weren't trained to work together. The model will often produce coherent output immediately (residual connections are robust), but it won't be optimal. Fine-tuning on a small dataset - even 1-2K examples - dramatically improves performance by teaching the model to bridge the seams between source models. :::

:::danger Don't merge with no evaluation loop Model merging is not deterministic in outcome. The same algorithm with different hyperparameters can produce results ranging from excellent to catastrophic. A merge pipeline without evaluation - at least running a few benchmark tasks - is flying blind. Always establish evaluation infrastructure before starting merge experiments. :::


Interview Q&A

Q: What is a frankenmodel and how does it differ from weight-averaged model merging?

A: A frankenmodel is assembled by copying complete transformer layers from different source models and concatenating or interleaving them, rather than blending individual weight values. In standard weight merging (TIES, DARE, SLERP), each parameter in the merged model is some combination of the corresponding parameters from all sources. In a frankenmodel, each layer comes entirely from one source model - no averaging occurs. The result is often a model with more total layers (and parameters) than either source. Frankenmodels work because transformer layers communicate through residual connections, which are relatively robust to having "foreign" layers inserted as long as the architecture is compatible.

Q: What is depth upscaling and what is the Solar 10.7B example?

A: Depth upscaling creates a larger model by duplicating and concatenating layers from a smaller model. Solar 10.7B (Upstage AI, 2023) started with two copies of Llama-2-7B. The first copy had its last 8 layers removed; the second had its first 8 layers removed. Concatenating the two produced a 48-layer model with approximately 10.7B parameters - significantly larger than the 7B source. After fine-tuning on instruction data, Solar 10.7B outperformed Llama-2-13B on several benchmarks. The approach works because copied layers provide a much better initialization than random or zero initialization for the new layers.

Q: What is the weight disentanglement problem and why does it limit model merging?

A: Weight disentanglement refers to the assumption in task arithmetic that different capabilities are linearly separable in weight space - that adding a code task vector affects only "coding" parameters. In reality, capabilities are entangled: the same weights participate in multiple behaviors simultaneously. When you add a code task vector, you're also adding the code model's changes to format behavior, verbosity, instruction-following style, and safety behavior - everything that was changed during code fine-tuning. This entanglement causes merged models to change in unexpected ways, and it limits how precisely you can compose capabilities via task arithmetic.

Q: What are the three main failure modes that cause model merges to degrade?

A: First, different base checkpoints: models from different pre-training runs are in different loss basins, so weight averaging produces models outside any low-loss region. Second, extensive fine-tuning: models that have been trained for many epochs with high learning rates have moved far from the original loss basin, so the loss convexity assumption breaks down. Third, highly divergent task domains: even with the same base, tasks that optimize conflicting objectives (e.g., safety alignment vs. unrestricted generation) create large sign conflicts and high-magnitude interference that TIES and DARE can only partially mitigate.

Q: What is evolutionary model merging and why is it significant?

A: Evolutionary model merging, from Sakana AI (2024), uses evolutionary algorithms to search the space of merge configurations: which source models to combine, which layers to use from each, and what merge coefficients to apply. The search is guided by task performance on a held-out dataset. This removes the need for human-designed merge recipes, automating the exploration of the combinatorially large space of possible merges. It's significant because it treats model merging as an optimization problem rather than a design problem - the algorithm discovers merge recipes that human intuition would be unlikely to find. The tradeoff is compute cost: the evolutionary search requires evaluating many candidate merges.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Model Merging: TIES, DARE & SLERP demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.