What is mamba vs transformer?

A rigorous benchmark comparison: perplexity, throughput, recall tasks, in-context learning, and the fundamental trade-off between compressed state and full context access.

How does benchmark comparison work in practice?

Mamba vs Transformer - When Each Wins covers mamba vs transformer, benchmark comparison, perplexity scaling from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/state-space-models/mamba-vs-transformer

What is the difference between mamba vs transformer and perplexity scaling?

See the full breakdown at https://engineersofai.com/docs/llms/state-space-models/mamba-vs-transformer

Mamba vs Transformer - When Each Wins

Opening Scenario: The Model Selection Meeting

Your team is building a code analysis tool that needs to process large codebases - repositories with 200K–500K tokens of context. You have a budget meeting with engineering leadership. Three engineers have strong opinions:

Engineer A says: "We should use Llama 3 70B. Best language understanding benchmarks. We can chunk the codebase into 8K windows and stitch the analysis together."

Engineer B says: "Mamba-7B processes the full codebase in one pass, uses 200MB of VRAM for state regardless of codebase size, and is 20x faster at inference. Yes, it's less accurate on some benchmarks, but the architecture actually fits the problem."

Engineer C says: "Jamba-1.5 Mini is a hybrid. We get most of Mamba's efficiency plus better recall for the precise look-up tasks code analysis requires."

Who is right? The answer depends on what "code analysis" actually means. For understanding architectural patterns across a large codebase (long-range structural reasoning), Mamba's efficient full-context processing is a genuine advantage. For tasks like "find all usages of function X" (precise retrieval), the transformer's uncompressed KV cache wins. For a mixed workload, the hybrid is likely the right call.

Making this decision correctly requires understanding the real benchmark data - not marketing claims, not intuition, but what actually happens when you run both architectures on the same tasks. This lesson gives you that analysis.

The Benchmark Landscape

Perplexity vs Compute: The Main Result

The headline result from the Mamba paper (Gu & Dao, 2023): Mamba-3B achieves lower perplexity than transformers trained with 3x more compute. On standard language model benchmarks (The Pile, Slimpajama), Mamba models match or beat transformers at equal parameter count:

Model	Params	Pile Perplexity	Training FLOPs
Pythia-160M	160M	29.34	1×
Mamba-130M	130M	28.01	0.9×
Pythia-410M	410M	22.54	1×
Mamba-370M	370M	21.49	0.95×
Pythia-1.4B	1.4B	18.44	1×
Mamba-1.4B	1.4B	17.53	1×
Pythia-2.8B	2.8B	16.23	1×
Mamba-2.8B	2.8B	15.65	1×

Across all scales tested, Mamba achieves better perplexity than Pythia (a well-calibrated transformer) at equal parameter count. This is not a minor difference - it suggests that the selective SSM is genuinely a better sequence modeling mechanism for language, at least at these scales (up to 3B parameters).

:::note Why Perplexity Is Not Enough Perplexity measures how well a model predicts the next token on average over a corpus. It is a necessary but not sufficient measure of model quality. A model can have excellent perplexity and still fail at tasks requiring precise retrieval, long-range reasoning, or in-context learning. The rest of this lesson examines what happens when you move beyond perplexity. :::

Zero-Shot Task Performance

On standard zero-shot benchmarks (HellaSwag, PIQA, ARC, WinoGrande, OpenBookQA), Mamba-2.8B performs comparably to transformers of similar parameter count:

Benchmark	Mamba-2.8B	GPT-Neo-2.7B	Description
HellaSwag	71.0%	65.6%	Commonsense completion
PIQA	77.9%	74.9%	Physical intuition
ARC-E	65.5%	61.8%	Science Q&A (easy)
ARC-C	34.3%	31.1%	Science Q&A (challenge)
WinoGrande	59.7%	57.6%	Pronoun resolution
OpenBookQA	38.8%	38.0%	Open-book science

Mamba wins on most of these benchmarks, but the margins are modest (2–5 points). These are "average" benchmarks measuring broad language understanding - both model types handle them similarly because the tasks don't heavily stress the specific strengths or weaknesses of either architecture.

Where Transformers Win: Recall and In-Context Learning

The MQAR Task: Multi-Query Associative Recall

The most revealing benchmark for understanding the architectural difference is MQAR (Multi-Query Associative Recall). The task:

Present a list of key-value pairs: A→1, B→2, C→3, ...
Present a series of queries: A→?, C→?, B→?
The model must output the correct values

This task directly measures whether the model can retrieve specific information from its context. It requires the model to remember, with precision, which value corresponds to which key - and then access that specific association when queried.

Results are striking:

Sequence Length	Transformer (2-layer, 64 heads)	Mamba (2-layer, d_state=16)
128	99.8%	24.5%
256	99.1%	12.3%
512	97.9%	8.7%
1024	95.2%	6.1%

Transformers are near-perfect at MQAR across all sequence lengths. Mamba struggles. The fundamental reason: MQAR requires exact associative retrieval - the model must look up "what value did I see paired with key A?" The transformer's KV cache preserves every key-value pair verbatim. Mamba's compressed state loses the specific pairings.

This is not a small gap - it is a qualitative capability difference. For any application requiring precise information retrieval from long contexts, transformers are the right choice.

Few-Shot / In-Context Learning

In-context learning (ICL) - the ability to learn a task from a few examples in the prompt - is one of the most powerful capabilities of large transformers. ICL relies on the model being able to:

Extract the pattern from the examples
Hold all examples simultaneously in accessible "memory"
Apply the pattern precisely to new inputs

Mamba performs well on simple ICL tasks (few-shot classification, simple translation) but underperforms on complex ICL tasks that require precise copying of patterns from earlier in the context.

A simplified experiment illustrating the difference:

# Pseudocode: ICL test setup
# Few-shot prompt with examples that require exact pattern replication

simple_icl_prompt = """
Translate to French:
Hello → Bonjour
Cat → Chat
Dog → Chien
House → """
# Both Mamba and Transformer handle this well (common words)

complex_icl_prompt = """
Apply the rule: add 'zzq' after every vowel.
Input: apple → apzzqpplzzqe
Input: banana → bzzqnzzqnzzq
Input: cherry → """
# Transformer: cherry → chzzqrrzzqy (applies exact rule)
# Mamba: may fail to replicate the exact suffix precisely

The more unusual and specific the pattern in the examples, the larger the gap between transformers and Mamba. Transformers can access examples verbatim through attention; Mamba must rely on its compressed state representation.

Where Mamba Wins: Efficiency at Long Sequences

Throughput Comparison

The Mamba paper reports inference throughput (tokens per second) vs transformer with comparable architecture (2-layer model for clean comparison):

Sequence Length	Transformer Throughput	Mamba Throughput	Mamba Speedup
128	~550K tok/s	~480K tok/s	0.87× (transformer wins)
512	~380K tok/s	~490K tok/s	1.29×
1,024	~250K tok/s	~490K tok/s	1.96×
2,048	~130K tok/s	~490K tok/s	3.77×
4,096	~65K tok/s	~490K tok/s	7.54×
16,384	~16K tok/s	~480K tok/s	30×

Notice that Mamba's throughput is roughly constant regardless of sequence length (around 490K tokens/second in this benchmark). Transformer throughput falls sharply as sequence length increases. At 16K tokens, Mamba is 30 times faster.

The crossover occurs around 512-1024 tokens: below this, transformers are faster due to better GPU utilization for the attention computation. Above this, Mamba's recurrent structure wins.

Memory Comparison During Inference

The practical memory footprint at inference tells a clear story:

def compare_inference_memory(
    model_size_b: float = 7.0,   # Model size in billions of parameters
    seq_lengths: list = None,
):
    """
    Compare transformer KV cache vs Mamba hidden state memory.
    Approximate calculations for a 7B-scale model.
    """
    if seq_lengths is None:
        seq_lengths = [1_000, 10_000, 100_000, 500_000, 1_000_000]

    # Transformer config (approximate for 7B model)
    n_layers = 32
    n_heads = 32
    head_dim = 128
    bytes_per_elem = 2  # float16

    # Mamba config (approximate for 7B Falcon Mamba)
    mamba_layers = 64
    d_inner = 8192        # 2 * d_model
    d_state = 16

    # Fixed Mamba state (constant regardless of seq_len)
    mamba_state_bytes = mamba_layers * d_inner * d_state * bytes_per_elem
    mamba_state_mb = mamba_state_bytes / 1e6

    print(f"Mamba hidden state: {mamba_state_mb:.1f} MB (constant)")
    print(f"\n{'Seq Len':>12} | {'Transformer KV (GB)':>22} | {'Ratio (T/M)':>15}")
    print("-" * 55)

    for seq_len in seq_lengths:
        kv_bytes = n_layers * 2 * n_heads * seq_len * head_dim * bytes_per_elem
        kv_gb = kv_bytes / 1e9
        ratio = kv_gb * 1000 / mamba_state_mb  # ratio (larger = more Transformer overhead)
        print(f"{seq_len:>12,} | {kv_gb:>22.2f} | {ratio:>14.0f}×")

# Output (approx for 7B-scale models):
# Mamba hidden state: 33.6 MB (constant)
#
#      Seq Len | Transformer KV (GB) |      Ratio (T/M)
# -------------------------------------------------------
#        1,000 |                0.52 |              15×
#       10,000 |                5.24 |             156×
#      100,000 |               52.43 |           1,560×
#      500,000 |              262.14 |           7,800×
#    1,000,000 |              524.29 |          15,600×

At 1M tokens, a transformer's KV cache is 15,600 times larger than Mamba's hidden state. This is not a slight difference - it determines whether the application is possible at all on a single GPU.

Domain-Specific Performance

Audio Modeling

Mamba was tested on raw audio waveforms (long sequences - a 10-second audio clip at 16kHz is 160,000 samples). On audio generation and classification tasks, Mamba outperforms transformers both in quality and efficiency. The local temporal patterns in audio favor Mamba's depthwise conv (for local features) plus SSM (for long-range context). This is a domain where SSMs are clearly preferable.

Genomics

DNA sequences can be millions of base pairs long. The HyenaDNA paper (Nguyen et al., 2023) showed that SSM-based models outperformed transformers on long-range genomics tasks, where the relevant regulatory signals can be hundreds of thousands of base pairs from the affected gene. Mamba-based models trained on full chromosomes achieve accuracy that transformer-based models, limited to much shorter context windows, simply cannot match.

Long-Document NLP

On SCROLLS (Long-Context Evaluation Suite), a benchmark with documents up to 100K tokens:

Task	GPT-3.5 (16K)	Mamba-2.8B	Notes
QuALITY	61.7%	57.3%	Q&A over 6K articles
SummScreenFD	18.2%	16.1%	TV show summarization
GovReport	57.3%	54.8%	Government report summarization
QASPER	41.2%	35.7%	Scientific paper Q&A

On long-document tasks, Mamba-2.8B is competitive with (but not beating) larger transformer models. The quality gap narrows significantly when Mamba can process the full document context rather than being constrained to a truncated window.

The Fundamental Trade-off: Compressed State vs Full Context

The core architectural difference translates to a concrete capability trade-off:

The Compressed State Problem in Practice

Imagine Mamba's hidden state as a highly compressed summary of everything it has read. If you ask it "what is the third word in the document?" it has to find that in the compressed summary. If the compression threw away that specific detail (because it seemed unimportant at the time), the model cannot recover it.

A transformer's KV cache is the full uncompressed memory. Every key and value from every previous position is stored verbatim. "What is the third word?" is answered by directly attending to position 3 - the information is always there.

This distinction matters for:

Legal document analysis: "What are the exact termination conditions in Section 7.3?" → Transformer
Code analysis for patterns: "Does this codebase use async/await consistently?" → Mamba or hybrid
Meeting summarization: "What were the main themes discussed?" → Mamba (compression is the goal)
Meeting analysis: "What exact commitment did Alice make about the Q3 deadline?" → Transformer

Scale: Does Mamba Beat Transformers at Larger Sizes?

The honest answer: at frontier scales (70B+), transformers still dominate. The Mamba paper tested up to 3B parameters. At this scale, Mamba wins. But:

Training data efficiency: Transformers at 70B have been trained on trillions of tokens with carefully engineered training recipes. Mamba models at comparable scale have not yet been trained with the same level of resources.
Architecture maturity: Transformer training is deeply understood - hyperparameter tuning, learning rate schedules, gradient clipping strategies. Mamba's optimal training recipe is less well-established.
Hybrid advantage: Every major deployment of SSM-like architectures at frontier scale (Jamba, Zamba) uses hybrids, not pure Mamba. This suggests that pure Mamba has qualitative limitations that require some attention layers to overcome.

The comparison at 70B+ scale is essentially unknown because no pure Mamba model at that scale has been publicly released and benchmarked against comparable transformers. The Falcon Mamba 7B (2024) is the largest publicly released pure SSM model and performs on par with Llama 3 8B on most benchmarks - but below on retrieval-heavy tasks.

Practical Decision Framework

Use this framework to decide between transformer, Mamba, and hybrid:

def recommend_architecture(
    max_sequence_length: int,
    task_requires_precise_retrieval: bool,
    task_requires_strong_icl: bool,
    memory_budget_gb: float,
    latency_requirement: str,  # "low", "medium", "high"
    scale: str,  # "small" (<7B), "medium" (7-30B), "large" (30B+)
) -> str:
    """
    Architecture recommendation based on task requirements.
    Returns: "transformer", "mamba", "hybrid", or "hybrid-required"
    """
    # Hard constraints first
    if scale == "large":
        # At large scale, only transformers have proven SOTA results
        return "transformer (hybrid worth exploring)"

    if max_sequence_length > 50_000:
        if task_requires_precise_retrieval:
            # Need full context AND precise retrieval: hybrid is the only viable choice
            return "hybrid (e.g., Jamba)"
        else:
            # Long sequences without precise retrieval: Mamba is ideal
            return "mamba"

    # Medium sequence lengths (1K-50K)
    if task_requires_precise_retrieval or task_requires_strong_icl:
        return "transformer"

    if memory_budget_gb < 20:
        # Tight memory budget: Mamba's constant state is critical
        return "mamba"

    if latency_requirement == "low" and max_sequence_length > 10_000:
        # Low latency at long context: Mamba wins
        return "mamba"

    # For most general NLP tasks under 10K tokens with adequate resources
    return "transformer (safe default, Mamba within 5% on most benchmarks)"

Common Mistakes

:::danger Benchmarking Mamba Only on Perplexity Perplexity is the best metric for raw language modeling ability, and Mamba scores well on it. But perplexity does not capture recall ability, precise retrieval, or ICL performance - exactly the tasks where transformers maintain a strong advantage. If your application involves any of these capabilities, measure them directly before deploying Mamba. :::

:::warning Assuming Mamba's Throughput Advantage Applies at All Sequence Lengths Mamba's throughput advantage only manifests at sequence lengths above ~512-1024 tokens. Below this, the transformer's highly optimized dense attention operations (Flash Attention, tensor cores) are faster than Mamba's parallel scan. For typical chatbot interactions (200-500 tokens), a transformer with Flash Attention will have lower latency than Mamba. :::

:::warning Comparing Mamba Models to Instruction-Tuned Transformers Mamba-2.8B (base model) should be compared to transformer base models (Pythia, GPT-Neo, Falcon 7B base), not to instruction-tuned models like ChatGPT or Llama 3 Instruct. Instruction tuning adds significant capability improvements that are independent of the underlying architecture. An unfair comparison makes Mamba appear worse than it is relative to transformers of comparable training. :::

Interview Q&A

Q1: On what tasks does Mamba outperform transformers, and what explains the advantage?

Mamba outperforms transformers in four main areas: (1) Throughput at long sequences - at 4K+ tokens, Mamba achieves 5-30x higher inference throughput because its recurrent computation is O(1) per token while the transformer's attention scales with context length; (2) Memory at inference - Mamba's hidden state is fixed size (~33MB for 7B) while the transformer KV cache grows linearly (52GB for 7B at 100K tokens); (3) Streaming applications - Mamba's constant-memory recurrence is ideal for processing tokens as they arrive; (4) Long-range pattern tasks in structured domains like genomics and audio, where the sequence structure is regular and compression works well. The advantage is architectural: Mamba processes long sequences in O(n) compute and O(1) memory, while transformers are O(n²) compute and O(n) memory.

Q2: What is the MQAR benchmark, and why does it show transformers outperforming Mamba?

MQAR (Multi-Query Associative Recall) presents key-value pairs (e.g., A→1, B→2) and then queries specific values (e.g., A→?). It tests exact associative retrieval - the ability to recover specific stored information precisely. Transformers excel at this because their KV cache stores every key-value pair verbatim; the attention mechanism can directly "look up" the relevant pair with near-perfect accuracy. Mamba compresses past context into a fixed-size hidden state; the specific key-value associations may be lost or blurred during compression. The gap is dramatic: transformers achieve 95-99% accuracy at all tested sequence lengths, while Mamba falls to under 10% accuracy at longer sequences.

Q3: Why might you choose a pure Mamba model over a hybrid (Jamba-style) architecture?

You would choose pure Mamba when: (1) Your task does not require precise retrieval (summarization, audio processing, genomics) - hybrids are overkill; (2) Memory efficiency is paramount - even a few attention layers add KV cache growth for those layers; (3) Deployment simplicity matters - pure Mamba has a simpler and more predictable memory profile; (4) Streaming inference is required with a strict constant-memory guarantee - hybrid models' attention layers still grow with context. In practice, for language applications, hybrids almost always outperform pure Mamba, making them the better default choice for NLP.

Q4: How does Mamba's performance compare to transformers at 70B scale?

This question has no definitive answer yet because no publicly released pure Mamba model at 70B+ has been benchmarked against comparable transformers with equivalent training. Falcon Mamba 7B (the largest public pure SSM, released 2024) performs comparably to Llama 3 8B on most benchmarks but underperforms on retrieval tasks. The hybrid Jamba-1.5 Large (94B parameters with 56B active) performs competitively with frontier models. The honest engineering answer: at 70B+ scale, use a transformer or hybrid until evidence emerges for pure SSM at that scale. The architecture question is genuinely open.

Q5: Describe the fundamental architectural trade-off between compressed state and full context access.

A transformer with attention maintains a complete, uncompressed copy of all past tokens (the KV cache). Every previous token's key and value vectors are preserved verbatim. During generation, the attention mechanism can access any past position with full precision. This comes at O(n) memory cost and O(n) compute per token.

A Mamba model maintains a fixed-size hidden state that is an information-lossy summary of all past tokens. The state update at each step follows $h_k = \bar{A}h_{k-1} + \bar{B}u_k$ - old information is mixed with new information, and some detail is inevitably lost. The state size is constant regardless of how many tokens have been processed. Access to past information is "soft" - the model cannot retrieve specific earlier content, only what remains in the compressed state.

This trade-off determines capability profiles: transformers excel at tasks requiring precise retrieval, exact copying, and in-context learning from specific examples. Mamba excels at tasks where the global pattern matters more than specific details: summarization, sentiment, translation, long-range structural understanding. The compressed state is not a bug - it forces the model to learn what actually matters for prediction, which for many tasks is the pattern, not the specifics.

Coding Tasks: A Nuanced Comparison

Code generation is an important test case because it combines multiple capabilities that favor different architectures:

Repository-level context (long-range, favors Mamba's efficiency): Understanding how modules depend on each other across a large codebase. Mamba can hold more of the codebase in context without memory constraints.

Precise API usage (exact recall, favors transformer): Reproducing the exact signature of a function defined 50K tokens earlier. "Call process_data(df, schema=Schema.V2, validate=True) with exactly these arguments." The transformer's KV cache can look up the exact function signature; Mamba's compressed state may misremember optional arguments.

Code completion (local pattern, both competitive): Completing a line of code based on local context. Both architectures handle this similarly.

Benchmarks from Mamba paper (Code benchmark on HumanEval):

Model	HumanEval Pass@1
Mamba-2.8B	36.1%
Code-Llama-7B	33.5%
Pythia-2.8B	23.2%
StarCoder-15.5B	33.6%

Mamba-2.8B slightly outperforms transformers of similar size on HumanEval - but HumanEval problems are short and self-contained, not requiring long-range cross-file retrieval. On longer-range coding tasks (like repository-level code completion), the comparison is less studied and likely favors hybrids.

Instruction Following and RLHF

An important caveat in the benchmarks above: most comparisons are between base models. Instruction following quality depends heavily on RLHF/RLHF-equivalent fine-tuning (DPO, ORPO, SFT on instruction data), which is independent of the underlying architecture.

A Mamba base model fine-tuned with DPO can follow instructions, use tools, and respond helpfully - just like a transformer. The quality of instruction following primarily depends on:

The quality of the instruction-tuning dataset
The amount of SFT/RLHF compute applied
The base model's language understanding (where architecture matters)

Falcon Mamba 7B Instruct (instruction-tuned Falcon Mamba) shows that SSMs can be effectively instruction-tuned and perform comparably to transformer instruct models of similar size on general instruction following benchmarks (MT-Bench, AlpacaEval).

State Size as a Quality Lever

In Mamba, the state dimension d_state is a hyperparameter that trades memory for quality. Unlike the KV cache (which grows with sequence length), increasing Mamba's state dimension increases memory by a fixed constant:

def mamba_state_size_tradeoff(
    d_model: int = 4096,
    n_layers: int = 64,
    dtype_bytes: int = 2,  # float16
):
    """
    Show how increasing d_state affects quality and memory (constant, not growing).
    """
    print(f"Mamba state memory for d_model={d_model}, n_layers={n_layers}:\n")
    print(f"{'d_state':>8} | {'State Memory (MB)':>20} | {'Quality Impact':>25}")
    print("-" * 60)

    quality_estimates = {
        8:  "~85% of d_state=64 quality",
        16: "~93% of d_state=64 quality (Mamba-1 default)",
        32: "~97% of d_state=64 quality",
        64: "Mamba-2 default - best quality",
        128: "Diminishing returns above 64",
        256: "Minimal improvement, 4x state cost",
    }

    for d_state in [8, 16, 32, 64, 128, 256]:
        state_bytes = n_layers * d_model * d_state * dtype_bytes
        state_mb = state_bytes / 1e6
        quality = quality_estimates.get(d_state, "Unknown")
        print(f"{d_state:>8} | {state_mb:>20.1f} | {quality:>25}")

    print("\nKey insight: even at d_state=256, state memory is only ~268MB.")
    print("Compare to transformer KV cache: 52GB at 100K tokens.")
    print("The quality vs memory tradeoff is entirely different architecturally.")

mamba_state_size_tradeoff()

The implication: if you need better recall from Mamba, you can increase d_state to improve compression quality while still maintaining O(1) memory at inference. The state grows, but only by a fixed constant - not with sequence length. This gives you a quality dial that transformers don't have in a comparable form.

Real-World Benchmark: Long Document Question Answering

To see how the architectural differences play out on a practical task, consider a long document Q&A evaluation:

Setup: 100 documents averaging 80K tokens each. Questions require finding specific numeric values, names, and dates from within the documents.

Task types:

"What was the revenue in Q3 2023?" (specific number retrieval)
"Who signed the agreement on behalf of the purchasing company?" (entity retrieval)
"Summarize the main risks outlined in section 5." (understanding, not retrieval)
"What is the overall tone of the document?" (global assessment)

Expected performance profile:

Task Type	Transformer	Mamba	Hybrid
Specific number retrieval	High	Lower	High
Entity retrieval	High	Lower	High
Section summarization	High	High	High
Global assessment	High	High	High
Within context window	Yes	Yes (full doc)	Yes (full doc)
Memory at 80K tokens	~40 GB	33 MB	~10 GB

For tasks 1 and 2, transformers win because the KV cache preserves the exact values. For tasks 3 and 4, Mamba is competitive because compression is aligned with the task. The hybrid wins on precision tasks while processing the full document, but requires more memory than pure Mamba.

This table is the decision matrix for architecture selection: map your application's task distribution to this profile, and choose accordingly.

Measuring the Compression: What Mamba Actually Remembers

A useful experiment to build intuition for what Mamba's hidden state preserves:

import torch
from transformers import MambaForCausalLM, AutoTokenizer

model = MambaForCausalLM.from_pretrained(
    "state-spaces/mamba-2.8b-hf",
    torch_dtype=torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained("state-spaces/mamba-2.8b-hf")

def test_recall_at_distance(model, tokenizer, distance_tokens: int):
    """
    Test if Mamba can recall a specific value placed N tokens before the query.
    """
    # Plant a specific value
    planted_value = "ZEPHYR-7742"
    filler = "The sky is blue. " * (distance_tokens // 5)

    prompt = (
        f"Remember this code: {planted_value}\n\n"
        f"{filler}\n\n"
        f"What was the code? The code was: "
    )

    inputs = tokenizer(prompt, return_tensors="pt")
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=20,
            do_sample=False,
        )

    generated = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:])
    success = planted_value.lower() in generated.lower()
    return success, generated.strip()

# Test at various distances
print("Mamba recall test at various distances:\n")
for distance in [100, 500, 1000, 2000, 5000]:
    success, generated = test_recall_at_distance(model, tokenizer, distance)
    status = "RECALLED" if success else "LOST"
    print(f"Distance {distance:>5} tokens: [{status}] → '{generated[:40]}'")

# Expected results: Mamba successfully recalls at shorter distances,
# success rate drops significantly beyond ~1000-2000 tokens
# Contrast with transformer: near-perfect recall at any distance in context window

This experiment concretely demonstrates the compressed-state limitation: specific verbatim recall degrades with distance. The model is not randomly forgetting - it is progressively compressing and the specific code string loses its exact representation in the state. Understanding this helps set expectations for Mamba-based applications.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Mamba vs Transformer demo on the EngineersOfAI Playground - no code required.

:::

Opening Scenario: The Model Selection Meeting​

The Benchmark Landscape​

Perplexity vs Compute: The Main Result​

Zero-Shot Task Performance​

Where Transformers Win: Recall and In-Context Learning​

The MQAR Task: Multi-Query Associative Recall​

Few-Shot / In-Context Learning​

Where Mamba Wins: Efficiency at Long Sequences​

Throughput Comparison​

Memory Comparison During Inference​

Domain-Specific Performance​

Audio Modeling​

Genomics​

Long-Document NLP​

The Fundamental Trade-off: Compressed State vs Full Context​

The Compressed State Problem in Practice​

Scale: Does Mamba Beat Transformers at Larger Sizes?​

Practical Decision Framework​

Common Mistakes​

Interview Q&A​

Coding Tasks: A Nuanced Comparison​

Instruction Following and RLHF​

State Size as a Quality Lever​

Real-World Benchmark: Long Document Question Answering​

Measuring the Compression: What Mamba Actually Remembers​