What is speculative decoding?

Learn how speculative decoding uses a small draft model to generate tokens that a large target model verifies in parallel, achieving 2-3x speedup with no quality loss.

How does draft model work in practice?

Speculative Decoding covers speculative decoding, draft model, target model from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/llm-inference/speculative-decoding

What is the difference between speculative decoding and target model?

See the full breakdown at https://engineersofai.com/docs/llms/llm-inference/speculative-decoding

Speculative Decoding

The Production Scenario

You are running a coding assistant backed by a 70B model. Users are frustrated. The model is brilliant - it writes production-quality code with perfect edge case handling - but it takes 45 seconds to generate a 300-line function. Users start typing their requests and then wander off to get coffee. The engagement metrics are terrible.

You cannot switch to a smaller model. You tried 7B and 13B. The code quality dropped enough that users started filing support tickets about bugs in AI-generated code. You need the 70B's quality. But the 70B's speed is unacceptable.

Then you read the speculative decoding paper. The key insight is almost annoyingly simple: the 70B model and a 7B model agree on most tokens - they differ only on difficult, nuanced tokens. For "if i == len(array) - 1:", every token is utterly predictable. The 70B is spending enormous compute to confirm what any reasonable model would generate. What if you could batch these confirmations?

Speculative decoding does exactly this. The 7B draft model generates 5 tokens. The 70B target model verifies all 5 in a single parallel forward pass - the same compute cost as generating one token the normal way. If the 7B guessed correctly (which it does ~70-80% of the time for code), you get 5 tokens for the price of 1. If it guessed wrong, you fall back gracefully and lose almost nothing. The result: 2–3× speedup with mathematically identical output distribution.

Why This Exists: Breaking the Sequential Bottleneck

As covered in Module 01, autoregressive decoding has an unavoidable sequential dependency: you cannot generate token $t$ until you have token $t-1$ . This prevents parallelization across the sequence length dimension during decode.

But there is a different kind of parallelization available: the target model can verify multiple proposed tokens simultaneously in a single forward pass. The forward pass is parallelized across the sequence dimension - you can process a prompt of 5 tokens as efficiently as a prompt of 1 token (from a per-step perspective, ignoring attention complexity). Speculative decoding exploits this.

The key observation enabling speculative decoding: for most tokens in typical text, most models agree. The uncertainty - and thus the need for the large model's judgment - is concentrated in a small fraction of tokens. If you can identify and skip the "easy" tokens cheaply, you only need the large model for the "hard" ones.

Historical Context

Speculative decoding was independently discovered and published twice in 2022:

Leviathan et al. (2022) - "Fast Inference from Transformers via Speculative Decoding" (Google)
Chen et al. (2022) - "Accelerating Large Language Model Decoding with Speculative Sampling" (DeepMind)

Both papers prove that the acceptance-rejection criterion guarantees the output distribution is identical to the target model - not approximately identical, but exactly identical in distribution. This "lossless" property is crucial for production deployment where you cannot accept quality regression.

Subsequent work improved the acceptance rate and reduced the need for a separate draft model:

Medusa (Cai et al., 2024): Add multiple prediction heads to the target model to predict future tokens
EAGLE (Li et al., 2024): Draft at the feature (hidden state) level rather than token level, achieving higher acceptance rates
Self-speculative decoding: Use early exit from intermediate layers as the draft

How Speculative Decoding Works

The Algorithm

Given: Target model $p$ (large, slow), Draft model $q$ (small, fast), draft length $k$

One speculative decoding step:

Draft phase: Run the draft model autoregressively to generate $k$ candidate tokens $(\tilde{x}_1, \tilde{x}_2, ..., \tilde{x}_k)$ , one at a time. Cost: $k$ small model forward passes.
Verification phase: Run the target model on the full sequence (prefix + $k$ drafted tokens) in ONE forward pass. This produces probability distributions $p(\cdot | \text{prefix}, \tilde{x}_1, ..., \tilde{x}_{i-1})$ for each position $i = 1, ..., k+1$ . Cost: 1 large model forward pass.
Accept/reject each drafted token with the following criterion:

For each token $\tilde{x}_i$ at position $i$ :

Compute acceptance probability: $\alpha_i = \min\left(1, \frac{p(\tilde{x}_i | \text{context})}{q(\tilde{x}_i | \text{context})}\right)$
Sample $u_i \sim \text{Uniform}(0, 1)$
If $u_i \leq \alpha_i$ : accept $\tilde{x}_i$ and continue to position $i+1$
If $u_i > \alpha_i$ : reject $\tilde{x}_i$ , sample a correction token from an adjusted distribution, and stop

Correction token: When a token is rejected at position $i$ , sample from: $p'(x) = \text{normalize}(\max(0, p(x | \text{context}) - q(x | \text{context})))$

This corrects for the draft model's error while maintaining the target distribution.

Why It Is Lossless

The acceptance probability $\alpha = \min(1, p/q)$ and the correction distribution are designed so that the marginal distribution of accepted tokens equals the target model's distribution $p$ . This is rejection sampling applied to token sequences.

Proof sketch: At each position $i$ , the probability of token $x$ being output is:

Probability accepted as draft: $q(x) \times \min(1, p(x)/q(x)) = \min(q(x), p(x))$
Probability from correction: $P(\text{rejected}) \times p'(x)$

The total probability sums to exactly $p(x)$ for all $x$ . The output distribution is exactly $p$ , regardless of the draft model's quality. ✓

Visualizing the Algorithm

Expected Speedup Math

Let $\alpha$ be the average acceptance rate per drafted token (probability a draft token is accepted). With draft length $k$ :

Expected tokens accepted per speculative step: $E[\text{accepted}] = \sum_{i=1}^{k} \alpha^i + 1 = \frac{1 - \alpha^{k+1}}{1 - \alpha}$

The "+1" accounts for the correction token (always generated, even on full acceptance).

Speedup ratio (assuming draft model costs $c$ fraction of target model cost):

$\text{speedup} = \frac{\text{expected tokens out}}{\text{compute cost}} = \frac{E[\text{accepted}] + 1}{k \cdot c + 1}$

For typical values ( $\alpha = 0.8$ , $k = 5$ , $c = 0.1$ for a 7B drafting for 70B):

$E[\text{accepted}] = \frac{1 - 0.8^6}{1 - 0.8} = 3.93$

$\text{speedup} = \frac{3.93 + 1}{5 \times 0.1 + 1} = \frac{4.93}{1.5} \approx 3.3\times$

This matches empirical results: speculative decoding with well-matched models typically achieves 2–3× speedup.

import numpy as np


def compute_expected_speedup(
    alpha: float,       # Acceptance rate per token
    k: int,            # Draft length
    c: float,          # Draft model cost as fraction of target
) -> dict:
    """
    Compute theoretical speedup for speculative decoding.
    """
    # Expected number of accepted draft tokens
    expected_accepted = sum(alpha ** i for i in range(1, k + 1))
    total_tokens_per_step = expected_accepted + 1  # +1 for correction token

    # Compute cost per step
    # k small forward passes + 1 large forward pass
    cost_per_step = k * c + 1  # In units of target model forward passes

    speedup = total_tokens_per_step / cost_per_step

    return {
        "alpha": alpha,
        "k": k,
        "c": c,
        "expected_accepted_tokens": round(expected_accepted, 2),
        "total_tokens_per_step": round(total_tokens_per_step, 2),
        "cost_per_step_target_equiv": round(cost_per_step, 2),
        "expected_speedup": round(speedup, 2),
    }


# Sensitivity analysis
print("Speculative Decoding Speedup Analysis")
print(f"{'Alpha':>8} {'k':>4} {'c':>6} {'Speedup':>10}")
print("-" * 35)

for alpha in [0.6, 0.7, 0.8, 0.85, 0.9]:
    result = compute_expected_speedup(alpha=alpha, k=5, c=0.1)
    print(f"{alpha:>8.2f} {'5':>4} {'0.10':>6} {result['expected_speedup']:>10.2f}x")

print()
# Vary k with fixed alpha and c
for k in [2, 3, 4, 5, 7, 10]:
    result = compute_expected_speedup(alpha=0.8, k=k, c=0.1)
    print(f"{'0.80':>8} {k:>4} {'0.10':>6} {result['expected_speedup']:>10.2f}x")

Implementing Speculative Decoding from Scratch

import torch
import torch.nn.functional as F
from typing import List, Tuple


def speculative_decode_step(
    target_model,
    draft_model,
    input_ids: torch.Tensor,
    k: int = 5,
    temperature: float = 1.0,
) -> Tuple[torch.Tensor, int, int]:
    """
    One step of speculative decoding: draft k tokens, verify with target.

    Args:
        target_model: Large model (slow, high quality)
        draft_model: Small model (fast, lower quality)
        input_ids: Current sequence [1, seq_len]
        k: Number of tokens to draft
        temperature: Sampling temperature

    Returns:
        new_input_ids: Updated sequence with new tokens
        n_accepted: Number of draft tokens accepted
        n_total: Total new tokens added (accepted + 1 correction)
    """
    device = input_ids.device
    seq_len = input_ids.shape[1]

    # --- Draft phase: generate k tokens with draft model ---
    draft_tokens = []
    draft_probs = []   # q(token | context) for each drafted token
    draft_ids = input_ids.clone()

    for _ in range(k):
        with torch.no_grad():
            draft_out = draft_model(draft_ids)
            logits = draft_out.logits[0, -1, :]  # Last position logits

            if temperature > 0:
                probs = F.softmax(logits / temperature, dim=-1)
                next_token = torch.multinomial(probs, num_samples=1)
            else:
                next_token = logits.argmax(dim=-1, keepdim=True)
                probs = F.one_hot(next_token, num_classes=logits.shape[-1]).float()

            # Store the draft token and its probability under the draft model
            token_prob = probs[next_token.item()].item()
            draft_tokens.append(next_token.item())
            draft_probs.append(token_prob)

            # Append to running sequence for next draft step
            draft_ids = torch.cat([draft_ids, next_token.unsqueeze(0)], dim=1)

    # --- Verification phase: single target model forward pass ---
    # Process prefix + all k draft tokens at once
    full_ids = draft_ids  # shape: [1, seq_len + k]

    with torch.no_grad():
        target_out = target_model(full_ids)
        # Get target model probabilities at each draft position
        # Position i in target_out.logits corresponds to predicting token i+1
        target_logits = target_out.logits[0, seq_len - 1 : seq_len + k - 1, :]

    # --- Accept/reject each draft token ---
    accepted_tokens = []
    n_accepted = 0

    for i, (draft_token, draft_prob) in enumerate(zip(draft_tokens, draft_probs)):
        # Target model probability for this token at this position
        target_probs_i = F.softmax(target_logits[i] / temperature, dim=-1)
        target_prob = target_probs_i[draft_token].item()

        # Acceptance probability
        alpha = min(1.0, target_prob / max(draft_prob, 1e-10))

        # Rejection sampling
        u = torch.rand(1).item()
        if u < alpha:
            # Accept draft token
            accepted_tokens.append(draft_token)
            n_accepted += 1
        else:
            # Reject: sample correction token from adjusted distribution
            # p'(x) = normalize(max(0, target(x) - draft(x)))
            draft_probs_i = F.softmax(
                target_logits[i] / temperature, dim=-1
            )  # Approximate draft probs at this position
            adjusted = torch.clamp(target_probs_i - draft_probs_i, min=0.0)

            if adjusted.sum() < 1e-8:
                # Fallback: sample from target
                correction_token = torch.multinomial(target_probs_i, 1).item()
            else:
                adjusted = adjusted / adjusted.sum()
                correction_token = torch.multinomial(adjusted, 1).item()

            accepted_tokens.append(correction_token)
            break

    # If all k tokens accepted, sample one more from the final target position
    if n_accepted == k:
        final_target_logits = target_out.logits[0, seq_len + k - 1, :]
        final_probs = F.softmax(final_target_logits / temperature, dim=-1)
        bonus_token = torch.multinomial(final_probs, 1).item()
        accepted_tokens.append(bonus_token)

    # Build new sequence
    new_tokens = torch.tensor([accepted_tokens], device=device)
    new_input_ids = torch.cat([input_ids, new_tokens], dim=1)

    n_total = len(accepted_tokens)
    return new_input_ids, n_accepted, n_total


def speculative_generate(
    target_model,
    draft_model,
    tokenizer,
    prompt: str,
    max_new_tokens: int = 200,
    k: int = 5,
    temperature: float = 0.8,
) -> Tuple[str, dict]:
    """
    Full speculative decoding generation with statistics.
    """
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"].to(next(target_model.parameters()).device)

    stats = {
        "total_draft_tokens": 0,
        "total_accepted_tokens": 0,
        "total_steps": 0,
        "total_output_tokens": 0,
    }

    n_generated = 0

    while n_generated < max_new_tokens:
        remaining = max_new_tokens - n_generated
        draft_k = min(k, remaining)

        input_ids, n_accepted, n_total = speculative_decode_step(
            target_model, draft_model, input_ids,
            k=draft_k, temperature=temperature
        )

        stats["total_draft_tokens"] += draft_k
        stats["total_accepted_tokens"] += n_accepted
        stats["total_steps"] += 1
        stats["total_output_tokens"] += n_total
        n_generated += n_total

        # Check for EOS
        if input_ids[0, -1].item() == tokenizer.eos_token_id:
            break

    # Compute derived stats
    stats["acceptance_rate"] = (
        stats["total_accepted_tokens"] / stats["total_draft_tokens"]
        if stats["total_draft_tokens"] > 0 else 0
    )
    stats["avg_tokens_per_step"] = (
        stats["total_output_tokens"] / stats["total_steps"]
        if stats["total_steps"] > 0 else 0
    )
    stats["theoretical_speedup"] = stats["avg_tokens_per_step"]  # vs 1 token/step baseline

    output_text = tokenizer.decode(
        input_ids[0, inputs["input_ids"].shape[1]:],
        skip_special_tokens=True
    )

    return output_text, stats

Choosing the Right Draft Model

The acceptance rate $\alpha$ is the most important factor determining speedup. It depends on how well the draft model's distribution matches the target.

What Makes a Good Draft Model

Same training distribution: Same pre-training data, same tokenizer
Same model family: LLaMA-3 8B drafting for LLaMA-3 70B - same architectural decisions
Size ratio ~10:1: Empirically, 7B drafting for 70B works well. Very small draft models (1B for 70B) have lower acceptance rates.
Same vocabulary: Must use identical tokenizer - cannot mix vocabularies

Draft Model Options

Target	Good Draft Models	Expected Alpha
LLaMA-3 70B	LLaMA-3 8B	0.75–0.85
Mistral 7B	Mistral 7B early layers (self-spec)	0.70–0.80
GPT-4 class	GPT-3.5 class (API-based)	0.65–0.75
Custom fine-tuned 70B	Matching fine-tuned 7B	0.80–0.90

Fine-tuning the target and draft models together on the same distribution improves alignment and acceptance rate.

Medusa: Multiple Prediction Heads

Medusa (Cai et al., 2024) avoids the need for a separate draft model entirely. Instead, it adds $k$ extra linear heads on top of the target model's final hidden states, each predicting the token at position $t+1, t+2, ..., t+k$ :

import torch
import torch.nn as nn


class MedusaHead(nn.Module):
    """
    Additional prediction head for Medusa speculative decoding.
    Each head predicts a future token offset from the current position.
    """

    def __init__(self, hidden_size: int, vocab_size: int, offset: int):
        super().__init__()
        self.offset = offset  # Predicts token at position t + offset
        self.head = nn.Sequential(
            nn.Linear(hidden_size, hidden_size, bias=False),
            nn.SiLU(),
            nn.Linear(hidden_size, vocab_size, bias=False)
        )

    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        return self.head(hidden_states)


class MedusaModel(nn.Module):
    """
    Wrapper that adds Medusa heads to an existing LLM.
    Only the Medusa heads are trained; the base model is frozen.
    """

    def __init__(self, base_model, num_heads: int = 5):
        super().__init__()
        self.base_model = base_model
        hidden_size = base_model.config.hidden_size
        vocab_size = base_model.config.vocab_size

        self.medusa_heads = nn.ModuleList([
            MedusaHead(hidden_size, vocab_size, offset=i + 1)
            for i in range(num_heads)
        ])

    def forward(self, input_ids: torch.Tensor):
        # Run base model, get hidden states
        outputs = self.base_model(
            input_ids,
            output_hidden_states=True
        )
        hidden_states = outputs.hidden_states[-1]  # Last layer

        # Base model logits (for position t)
        base_logits = outputs.logits

        # Medusa heads (for positions t+1, t+2, ...)
        medusa_logits = [head(hidden_states) for head in self.medusa_heads]

        return base_logits, medusa_logits


def medusa_tree_decode(
    model: MedusaModel,
    input_ids: torch.Tensor,
    k: int = 5,
    temperature: float = 1.0,
    top_k_candidates: int = 5
) -> torch.Tensor:
    """
    Simplified Medusa decoding with candidate tree.

    The full Medusa implementation uses a candidate tree where
    each head's top-k predictions create branches - the target
    model verifies all branches simultaneously.
    """
    with torch.no_grad():
        base_logits, medusa_logits = model(input_ids)

    # Get top candidates from base model (for current position)
    base_probs = F.softmax(base_logits[0, -1] / temperature, dim=-1)
    base_top = torch.topk(base_probs, top_k_candidates)

    # Get top candidates from each Medusa head
    candidates = [[token.item() for token in base_top.indices]]
    for head_logits in medusa_logits[:k]:
        head_probs = F.softmax(head_logits[0, -1] / temperature, dim=-1)
        top_tokens = torch.topk(head_probs, top_k_candidates).indices
        candidates.append([t.item() for t in top_tokens])

    # In full Medusa: build tree of all candidate combinations,
    # verify with base model in a single batched forward pass.
    # Here we just return the most likely candidate sequence.
    best_sequence = [c[0] for c in candidates]
    return torch.tensor([best_sequence])

Medusa advantages over standard speculative decoding:

No separate draft model required
Draft heads trained cheaply (freeze base, train only heads)
Heads share the base model's rich representations
Can use tree-structured verification to evaluate multiple candidate trees

Medusa disadvantage: The heads predict independently (no autoregressive conditioning on each other's predictions), which limits acceptance rates compared to a full draft model.

EAGLE: Feature-Level Drafting

EAGLE (Li et al., 2024) improves on both standard speculative decoding and Medusa by drafting at the feature (hidden state) level rather than the token level:

Instead of predicting future tokens directly, EAGLE predicts future hidden states using a lightweight autoregressive model. These predicted hidden states are then passed through the target model's final layers to get token predictions.

This approach achieves higher acceptance rates (0.85–0.90 vs 0.75–0.85) because hidden states contain more information than one-hot token predictions, allowing better conditioning for future predictions.

Typical EAGLE speedup: 3–4× vs baseline decode, vs 2–3× for standard speculative decoding.

Self-Speculative Decoding

Self-speculative decoding uses the target model itself as the draft model by exiting early from intermediate layers:

Run the target model's forward pass but exit after layer $m$ (e.g., after 50% of layers)
Use these "shallow" outputs as draft token predictions
Complete the full forward pass to verify (or reject) the shallow predictions

This eliminates the need for a separate draft model entirely. It works because early layer predictions are often correct for easy tokens.

def self_speculative_step(
    model,
    input_ids: torch.Tensor,
    early_exit_layer: int,
    k: int = 5,
    temperature: float = 1.0
) -> Tuple[torch.Tensor, int]:
    """
    Self-speculative decoding: use early exit as the draft model.

    The target model runs a "shallow" forward pass for drafting,
    then a full forward pass for verification.
    Requires model to support early_exit_layer parameter.
    """
    # Draft: early exit at layer early_exit_layer
    draft_tokens = []
    draft_probs = []
    draft_ids = input_ids.clone()

    for _ in range(k):
        with torch.no_grad():
            # Partial forward pass
            outputs = model(
                draft_ids,
                output_hidden_states=True,
                early_exit_layer=early_exit_layer  # Custom parameter
            )
            # Use hidden state at exit layer as logits (via lm_head)
            exit_hidden = outputs.hidden_states[early_exit_layer]
            shallow_logits = model.lm_head(model.model.norm(exit_hidden))
            probs = F.softmax(shallow_logits[0, -1] / temperature, dim=-1)
            token = torch.multinomial(probs, 1)
            draft_tokens.append(token.item())
            draft_probs.append(probs[token.item()].item())
            draft_ids = torch.cat([draft_ids, token.unsqueeze(0)], dim=1)

    # Verify: full forward pass
    with torch.no_grad():
        full_outputs = model(draft_ids)  # Full pass
        # Verification proceeds as in standard speculative decoding...

    n_accepted = k  # Simplified - real implementation uses acceptance criterion
    return draft_ids, n_accepted

Production Considerations

When to Use Speculative Decoding

Scenario	Recommended	Why
Interactive chat	Yes	Latency-sensitive; 2–3× speedup directly improves UX
Code completion	Yes	High acceptance rate (predictable syntax); 3–4× speedup
Creative writing	Yes	Good acceptance rate; user notices speed
Batch processing	No	Throughput > latency; continuous batching is better
Very short outputs	No	Overhead of draft/verify amortizes poorly
Very diverse outputs (T=1.5)	Maybe	Lower acceptance rate reduces speedup

Infrastructure Requirements

def estimate_speculative_decoding_requirements(
    target_params_b: float,
    draft_params_b: float,
    batch_size: int,
    k: int = 5
) -> dict:
    """
    Estimate memory and throughput for speculative decoding deployment.
    """
    # Memory requirements
    target_gpu_gb = target_params_b * 2  # FP16
    draft_gpu_gb = draft_params_b * 2    # FP16
    total_gpu_gb = target_gpu_gb + draft_gpu_gb  # Both must fit simultaneously

    # Throughput estimate
    # Target: one forward pass every k+1 tokens (k draft + 1 verify)
    # Draft: k forward passes every k+1 tokens
    # Net: roughly (expected_tokens_per_step) / (cost_of_verify) improvement

    alpha = 0.8  # Typical acceptance rate
    expected_tokens = sum(alpha ** i for i in range(1, k + 1)) + 1
    speedup = expected_tokens / (1 + k * (draft_params_b / target_params_b))

    return {
        "target_gpu_gb": target_gpu_gb,
        "draft_gpu_gb": draft_gpu_gb,
        "total_gpu_gb": total_gpu_gb,
        "num_a100_80gb_needed": int(total_gpu_gb / 80) + 1,
        "expected_speedup": round(speedup, 2),
        "recommendation": (
            "Fits on single node" if total_gpu_gb <= 640
            else "Multi-node required"
        )
    }


# LLaMA-3 70B + LLaMA-3 8B draft
result = estimate_speculative_decoding_requirements(
    target_params_b=70,
    draft_params_b=8,
    batch_size=8,
    k=5
)
print("LLaMA-3 70B + LLaMA-3 8B Draft System:")
for k_name, v in result.items():
    print(f"  {k_name}: {v}")

vLLM Speculative Decoding Setup

from vllm import LLM, SamplingParams

def setup_vllm_speculative(
    target_model: str = "meta-llama/Llama-3-70b-instruct",
    draft_model: str = "meta-llama/Llama-3-8b-instruct",
    num_speculative_tokens: int = 5
):
    """
    Configure vLLM with speculative decoding.
    vLLM handles PagedAttention for both target and draft models.
    """
    llm = LLM(
        model=target_model,
        speculative_model=draft_model,
        num_speculative_tokens=num_speculative_tokens,
        tensor_parallel_size=4,   # 4 GPUs for target 70B
        # vLLM automatically handles KV cache for both models
    )

    sampling_params = SamplingParams(
        temperature=0.8,
        top_p=0.9,
        max_tokens=512
    )

    return llm, sampling_params

Common Mistakes

:::danger Expecting linear speedup with draft length k Increasing $k$ beyond the acceptance rate's natural limit gives diminishing returns. At $\alpha = 0.8$ and $k = 5$ , expected accepted tokens = 3.93. At $k = 10$ , it is only 4.46 - barely more. But the draft cost doubles. There is an optimal $k$ for each $\alpha$ : approximately $k^* \approx 1/(1-\alpha)$ . For $\alpha = 0.8$ , optimal $k \approx 5$ . For $\alpha = 0.9$ , optimal $k \approx 10$ . Don't blindly use large $k$ . :::

:::danger Using mismatched tokenizers between draft and target Draft and target models must use the exact same tokenizer with the same vocabulary. Different tokenizers produce different token IDs for the same text - the draft model's token 1234 is not the target model's token 1234. Even minor tokenizer differences (different special tokens, different BPE merges) completely break the acceptance criterion. Always verify that draft and target models share identical tokenizer configurations before deploying. :::

:::warning Speculative decoding hurts throughput in high-batch scenarios Speculative decoding reduces per-request latency but can hurt throughput. At high batch sizes, continuous batching fills GPUs efficiently - adding speculative decoding overhead per request can reduce the effective number of requests served per second. Speculative decoding is a latency optimization, not a throughput optimization. Use it for interactive, latency-sensitive workloads. For batch processing pipelines, continuous batching alone (without speculative decoding) is typically better. :::

:::warning Low acceptance rate with high temperature Speculative decoding acceptance rate drops at high temperature ( $T > 1.0$ ) because the target model's distribution becomes flatter - the draft model's top predictions are less likely to match. At $T = 1.5$ , acceptance rate can drop to 0.5–0.6, severely reducing speedup. Consider reducing draft length $k$ at high temperatures or disabling speculative decoding for very creative tasks where high temperature is required. :::

Interview Questions

Q1: Explain speculative decoding. Why is it lossless?

Speculative decoding uses a small draft model to generate $k$ candidate tokens, then uses the large target model to verify all $k$ in a single parallel forward pass. Each candidate is accepted with probability $\min(1, p_{\text{target}} / p_{\text{draft}})$ - rejection sampling. If rejected, a correction token is sampled from $\text{normalize}(\max(0, p_{\text{target}} - p_{\text{draft}}))$ . The acceptance criterion guarantees that the marginal distribution of each output token exactly matches the target model's distribution - it is exactly equivalent to sampling from the target model directly. This mathematical guarantee means you can use speculative decoding in production without any quality regression.

Q2: What is the expected speedup formula and what drives acceptance rate?

Expected speedup $\approx E[\text{accepted tokens}] / \text{cost per step}$ , where $E[\text{accepted}] = \sum_{i=1}^{k} \alpha^i$ and cost includes $k$ draft passes plus 1 target pass. The key variable is $\alpha$ , the per-token acceptance rate. High $\alpha$ comes from: (1) draft and target model trained on the same distribution, (2) similar model family and architecture, (3) appropriate size ratio (~10:1 is common), and (4) lower sampling temperature (greedy-leaning distributions have higher acceptance rates). For typical code generation (deterministic syntax), $\alpha$ reaches 0.85–0.90 and speedup is 3–4×.

Q3: What is the difference between standard speculative decoding and Medusa?

Standard speculative decoding uses a separate draft model - a distinct model loaded in parallel. Medusa adds multiple lightweight prediction heads to the target model itself. Each head predicts a future token (head $i$ predicts token at position $t+i$ ). The heads share the target model's rich hidden states, so they can be smaller than a full draft model. Medusa has lower hardware requirements (no second model) but typically lower acceptance rates than a well-matched separate draft model because the heads predict future tokens independently without autoregressive conditioning on each other.

Q4: When should you NOT use speculative decoding in production?

(1) Batch processing workloads where throughput matters more than latency - continuous batching at high batch sizes is more efficient. (2) Very short outputs (under 20 tokens) - the setup overhead amortizes poorly. (3) Very high temperature ( $T > 1.3$ ) creative tasks - low acceptance rates reduce speedup below break-even. (4) When you lack a well-matched draft model of the same family - mismatched models have low acceptance rates. (5) When GPU memory is tight - you need both target and draft models in memory simultaneously.

Q5: How does self-speculative decoding work and what are its trade-offs vs using a separate draft model?

Self-speculative decoding runs the target model with early exit - stopping after layer $m$ (e.g., 40% of total layers) to generate draft tokens, then completing the full forward pass for verification. Advantages: no extra memory for a separate model, no tokenizer mismatch issues, simpler infrastructure. Disadvantages: lower acceptance rates than a well-matched separate draft model (early-exit predictions are less accurate), and the early-exit forward pass still costs $m/L$ fraction of the full forward pass, so the effective draft cost is not negligible. Typically provides 1.5–2× speedup vs 2–3× for separate draft models.

Q6: How does vLLM integrate speculative decoding with PagedAttention?

vLLM manages separate PagedAttention KV caches for both the draft and target models. During the draft phase, tokens are generated autoregressively using the draft model's KV cache. During verification, the target model processes the full speculative sequence (prefix + k draft tokens) in one forward pass, using its own KV cache. When draft tokens are rejected, vLLM rewinds the target model's KV cache to the last accepted position and discards the rejected tokens' cached K/V values. The BlockManager allocates blocks for both draft and target models' KV caches from the same GPU memory pool, managed with the same reference-counting and copy-on-write mechanisms as regular continuous batching.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Speculative Decoding demo on the EngineersOfAI Playground - no code required.

:::

The Production Scenario​

Why This Exists: Breaking the Sequential Bottleneck​

Historical Context​

How Speculative Decoding Works​

The Algorithm​

Why It Is Lossless​

Visualizing the Algorithm​

Expected Speedup Math​

Implementing Speculative Decoding from Scratch​

Choosing the Right Draft Model​

What Makes a Good Draft Model​

Draft Model Options​

Medusa: Multiple Prediction Heads​

EAGLE: Feature-Level Drafting​

Self-Speculative Decoding​

Production Considerations​

When to Use Speculative Decoding​

Infrastructure Requirements​

vLLM Speculative Decoding Setup​

Common Mistakes​

Interview Questions​