Skip to main content

Speculative Decoding

The Production Scenario

You are running a coding assistant backed by a 70B model. Users are frustrated. The model is brilliant - it writes production-quality code with perfect edge case handling - but it takes 45 seconds to generate a 300-line function. Users start typing their requests and then wander off to get coffee. The engagement metrics are terrible.

You cannot switch to a smaller model. You tried 7B and 13B. The code quality dropped enough that users started filing support tickets about bugs in AI-generated code. You need the 70B's quality. But the 70B's speed is unacceptable.

Then you read the speculative decoding paper. The key insight is almost annoyingly simple: the 70B model and a 7B model agree on most tokens - they differ only on difficult, nuanced tokens. For "if i == len(array) - 1:", every token is utterly predictable. The 70B is spending enormous compute to confirm what any reasonable model would generate. What if you could batch these confirmations?

Speculative decoding does exactly this. The 7B draft model generates 5 tokens. The 70B target model verifies all 5 in a single parallel forward pass - the same compute cost as generating one token the normal way. If the 7B guessed correctly (which it does ~70-80% of the time for code), you get 5 tokens for the price of 1. If it guessed wrong, you fall back gracefully and lose almost nothing. The result: 2–3× speedup with mathematically identical output distribution.


Why This Exists: Breaking the Sequential Bottleneck

As covered in Module 01, autoregressive decoding has an unavoidable sequential dependency: you cannot generate token tt until you have token t1t-1. This prevents parallelization across the sequence length dimension during decode.

But there is a different kind of parallelization available: the target model can verify multiple proposed tokens simultaneously in a single forward pass. The forward pass is parallelized across the sequence dimension - you can process a prompt of 5 tokens as efficiently as a prompt of 1 token (from a per-step perspective, ignoring attention complexity). Speculative decoding exploits this.

The key observation enabling speculative decoding: for most tokens in typical text, most models agree. The uncertainty - and thus the need for the large model's judgment - is concentrated in a small fraction of tokens. If you can identify and skip the "easy" tokens cheaply, you only need the large model for the "hard" ones.


Historical Context

Speculative decoding was independently discovered and published twice in 2022:

  • Leviathan et al. (2022) - "Fast Inference from Transformers via Speculative Decoding" (Google)
  • Chen et al. (2022) - "Accelerating Large Language Model Decoding with Speculative Sampling" (DeepMind)

Both papers prove that the acceptance-rejection criterion guarantees the output distribution is identical to the target model - not approximately identical, but exactly identical in distribution. This "lossless" property is crucial for production deployment where you cannot accept quality regression.

Subsequent work improved the acceptance rate and reduced the need for a separate draft model:

  • Medusa (Cai et al., 2024): Add multiple prediction heads to the target model to predict future tokens
  • EAGLE (Li et al., 2024): Draft at the feature (hidden state) level rather than token level, achieving higher acceptance rates
  • Self-speculative decoding: Use early exit from intermediate layers as the draft

How Speculative Decoding Works

The Algorithm

Given: Target model pp (large, slow), Draft model qq (small, fast), draft length kk

One speculative decoding step:

  1. Draft phase: Run the draft model autoregressively to generate kk candidate tokens (x~1,x~2,...,x~k)(\tilde{x}_1, \tilde{x}_2, ..., \tilde{x}_k), one at a time. Cost: kk small model forward passes.

  2. Verification phase: Run the target model on the full sequence (prefix + kk drafted tokens) in ONE forward pass. This produces probability distributions p(prefix,x~1,...,x~i1)p(\cdot | \text{prefix}, \tilde{x}_1, ..., \tilde{x}_{i-1}) for each position i=1,...,k+1i = 1, ..., k+1. Cost: 1 large model forward pass.

  3. Accept/reject each drafted token with the following criterion:

For each token x~i\tilde{x}_i at position ii:

  • Compute acceptance probability: αi=min(1,p(x~icontext)q(x~icontext))\alpha_i = \min\left(1, \frac{p(\tilde{x}_i | \text{context})}{q(\tilde{x}_i | \text{context})}\right)
  • Sample uiUniform(0,1)u_i \sim \text{Uniform}(0, 1)
  • If uiαiu_i \leq \alpha_i: accept x~i\tilde{x}_i and continue to position i+1i+1
  • If ui>αiu_i > \alpha_i: reject x~i\tilde{x}_i, sample a correction token from an adjusted distribution, and stop
  1. Correction token: When a token is rejected at position ii, sample from: p(x)=normalize(max(0,p(xcontext)q(xcontext)))p'(x) = \text{normalize}(\max(0, p(x | \text{context}) - q(x | \text{context})))

This corrects for the draft model's error while maintaining the target distribution.

Why It Is Lossless

The acceptance probability α=min(1,p/q)\alpha = \min(1, p/q) and the correction distribution are designed so that the marginal distribution of accepted tokens equals the target model's distribution pp. This is rejection sampling applied to token sequences.

Proof sketch: At each position ii, the probability of token xx being output is:

  • Probability accepted as draft: q(x)×min(1,p(x)/q(x))=min(q(x),p(x))q(x) \times \min(1, p(x)/q(x)) = \min(q(x), p(x))
  • Probability from correction: P(rejected)×p(x)P(\text{rejected}) \times p'(x)

The total probability sums to exactly p(x)p(x) for all xx. The output distribution is exactly pp, regardless of the draft model's quality. ✓


Visualizing the Algorithm


Expected Speedup Math

Let α\alpha be the average acceptance rate per drafted token (probability a draft token is accepted). With draft length kk:

Expected tokens accepted per speculative step: E[accepted]=i=1kαi+1=1αk+11αE[\text{accepted}] = \sum_{i=1}^{k} \alpha^i + 1 = \frac{1 - \alpha^{k+1}}{1 - \alpha}

The "+1" accounts for the correction token (always generated, even on full acceptance).

Speedup ratio (assuming draft model costs cc fraction of target model cost):

speedup=expected tokens outcompute cost=E[accepted]+1kc+1\text{speedup} = \frac{\text{expected tokens out}}{\text{compute cost}} = \frac{E[\text{accepted}] + 1}{k \cdot c + 1}

For typical values (α=0.8\alpha = 0.8, k=5k = 5, c=0.1c = 0.1 for a 7B drafting for 70B):

E[accepted]=10.8610.8=3.93E[\text{accepted}] = \frac{1 - 0.8^6}{1 - 0.8} = 3.93

speedup=3.93+15×0.1+1=4.931.53.3×\text{speedup} = \frac{3.93 + 1}{5 \times 0.1 + 1} = \frac{4.93}{1.5} \approx 3.3\times

This matches empirical results: speculative decoding with well-matched models typically achieves 2–3× speedup.

import numpy as np


def compute_expected_speedup(
alpha: float, # Acceptance rate per token
k: int, # Draft length
c: float, # Draft model cost as fraction of target
) -> dict:
"""
Compute theoretical speedup for speculative decoding.
"""
# Expected number of accepted draft tokens
expected_accepted = sum(alpha ** i for i in range(1, k + 1))
total_tokens_per_step = expected_accepted + 1 # +1 for correction token

# Compute cost per step
# k small forward passes + 1 large forward pass
cost_per_step = k * c + 1 # In units of target model forward passes

speedup = total_tokens_per_step / cost_per_step

return {
"alpha": alpha,
"k": k,
"c": c,
"expected_accepted_tokens": round(expected_accepted, 2),
"total_tokens_per_step": round(total_tokens_per_step, 2),
"cost_per_step_target_equiv": round(cost_per_step, 2),
"expected_speedup": round(speedup, 2),
}


# Sensitivity analysis
print("Speculative Decoding Speedup Analysis")
print(f"{'Alpha':>8} {'k':>4} {'c':>6} {'Speedup':>10}")
print("-" * 35)

for alpha in [0.6, 0.7, 0.8, 0.85, 0.9]:
result = compute_expected_speedup(alpha=alpha, k=5, c=0.1)
print(f"{alpha:>8.2f} {'5':>4} {'0.10':>6} {result['expected_speedup']:>10.2f}x")

print()
# Vary k with fixed alpha and c
for k in [2, 3, 4, 5, 7, 10]:
result = compute_expected_speedup(alpha=0.8, k=k, c=0.1)
print(f"{'0.80':>8} {k:>4} {'0.10':>6} {result['expected_speedup']:>10.2f}x")

Implementing Speculative Decoding from Scratch

import torch
import torch.nn.functional as F
from typing import List, Tuple


def speculative_decode_step(
target_model,
draft_model,
input_ids: torch.Tensor,
k: int = 5,
temperature: float = 1.0,
) -> Tuple[torch.Tensor, int, int]:
"""
One step of speculative decoding: draft k tokens, verify with target.

Args:
target_model: Large model (slow, high quality)
draft_model: Small model (fast, lower quality)
input_ids: Current sequence [1, seq_len]
k: Number of tokens to draft
temperature: Sampling temperature

Returns:
new_input_ids: Updated sequence with new tokens
n_accepted: Number of draft tokens accepted
n_total: Total new tokens added (accepted + 1 correction)
"""
device = input_ids.device
seq_len = input_ids.shape[1]

# --- Draft phase: generate k tokens with draft model ---
draft_tokens = []
draft_probs = [] # q(token | context) for each drafted token
draft_ids = input_ids.clone()

for _ in range(k):
with torch.no_grad():
draft_out = draft_model(draft_ids)
logits = draft_out.logits[0, -1, :] # Last position logits

if temperature > 0:
probs = F.softmax(logits / temperature, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
else:
next_token = logits.argmax(dim=-1, keepdim=True)
probs = F.one_hot(next_token, num_classes=logits.shape[-1]).float()

# Store the draft token and its probability under the draft model
token_prob = probs[next_token.item()].item()
draft_tokens.append(next_token.item())
draft_probs.append(token_prob)

# Append to running sequence for next draft step
draft_ids = torch.cat([draft_ids, next_token.unsqueeze(0)], dim=1)

# --- Verification phase: single target model forward pass ---
# Process prefix + all k draft tokens at once
full_ids = draft_ids # shape: [1, seq_len + k]

with torch.no_grad():
target_out = target_model(full_ids)
# Get target model probabilities at each draft position
# Position i in target_out.logits corresponds to predicting token i+1
target_logits = target_out.logits[0, seq_len - 1 : seq_len + k - 1, :]

# --- Accept/reject each draft token ---
accepted_tokens = []
n_accepted = 0

for i, (draft_token, draft_prob) in enumerate(zip(draft_tokens, draft_probs)):
# Target model probability for this token at this position
target_probs_i = F.softmax(target_logits[i] / temperature, dim=-1)
target_prob = target_probs_i[draft_token].item()

# Acceptance probability
alpha = min(1.0, target_prob / max(draft_prob, 1e-10))

# Rejection sampling
u = torch.rand(1).item()
if u < alpha:
# Accept draft token
accepted_tokens.append(draft_token)
n_accepted += 1
else:
# Reject: sample correction token from adjusted distribution
# p'(x) = normalize(max(0, target(x) - draft(x)))
draft_probs_i = F.softmax(
target_logits[i] / temperature, dim=-1
) # Approximate draft probs at this position
adjusted = torch.clamp(target_probs_i - draft_probs_i, min=0.0)

if adjusted.sum() < 1e-8:
# Fallback: sample from target
correction_token = torch.multinomial(target_probs_i, 1).item()
else:
adjusted = adjusted / adjusted.sum()
correction_token = torch.multinomial(adjusted, 1).item()

accepted_tokens.append(correction_token)
break

# If all k tokens accepted, sample one more from the final target position
if n_accepted == k:
final_target_logits = target_out.logits[0, seq_len + k - 1, :]
final_probs = F.softmax(final_target_logits / temperature, dim=-1)
bonus_token = torch.multinomial(final_probs, 1).item()
accepted_tokens.append(bonus_token)

# Build new sequence
new_tokens = torch.tensor([accepted_tokens], device=device)
new_input_ids = torch.cat([input_ids, new_tokens], dim=1)

n_total = len(accepted_tokens)
return new_input_ids, n_accepted, n_total


def speculative_generate(
target_model,
draft_model,
tokenizer,
prompt: str,
max_new_tokens: int = 200,
k: int = 5,
temperature: float = 0.8,
) -> Tuple[str, dict]:
"""
Full speculative decoding generation with statistics.
"""
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].to(next(target_model.parameters()).device)

stats = {
"total_draft_tokens": 0,
"total_accepted_tokens": 0,
"total_steps": 0,
"total_output_tokens": 0,
}

n_generated = 0

while n_generated < max_new_tokens:
remaining = max_new_tokens - n_generated
draft_k = min(k, remaining)

input_ids, n_accepted, n_total = speculative_decode_step(
target_model, draft_model, input_ids,
k=draft_k, temperature=temperature
)

stats["total_draft_tokens"] += draft_k
stats["total_accepted_tokens"] += n_accepted
stats["total_steps"] += 1
stats["total_output_tokens"] += n_total
n_generated += n_total

# Check for EOS
if input_ids[0, -1].item() == tokenizer.eos_token_id:
break

# Compute derived stats
stats["acceptance_rate"] = (
stats["total_accepted_tokens"] / stats["total_draft_tokens"]
if stats["total_draft_tokens"] > 0 else 0
)
stats["avg_tokens_per_step"] = (
stats["total_output_tokens"] / stats["total_steps"]
if stats["total_steps"] > 0 else 0
)
stats["theoretical_speedup"] = stats["avg_tokens_per_step"] # vs 1 token/step baseline

output_text = tokenizer.decode(
input_ids[0, inputs["input_ids"].shape[1]:],
skip_special_tokens=True
)

return output_text, stats

Choosing the Right Draft Model

The acceptance rate α\alpha is the most important factor determining speedup. It depends on how well the draft model's distribution matches the target.

What Makes a Good Draft Model

  1. Same training distribution: Same pre-training data, same tokenizer
  2. Same model family: LLaMA-3 8B drafting for LLaMA-3 70B - same architectural decisions
  3. Size ratio ~10:1: Empirically, 7B drafting for 70B works well. Very small draft models (1B for 70B) have lower acceptance rates.
  4. Same vocabulary: Must use identical tokenizer - cannot mix vocabularies

Draft Model Options

TargetGood Draft ModelsExpected Alpha
LLaMA-3 70BLLaMA-3 8B0.75–0.85
Mistral 7BMistral 7B early layers (self-spec)0.70–0.80
GPT-4 classGPT-3.5 class (API-based)0.65–0.75
Custom fine-tuned 70BMatching fine-tuned 7B0.80–0.90

Fine-tuning the target and draft models together on the same distribution improves alignment and acceptance rate.


Medusa: Multiple Prediction Heads

Medusa (Cai et al., 2024) avoids the need for a separate draft model entirely. Instead, it adds kk extra linear heads on top of the target model's final hidden states, each predicting the token at position t+1,t+2,...,t+kt+1, t+2, ..., t+k:

import torch
import torch.nn as nn


class MedusaHead(nn.Module):
"""
Additional prediction head for Medusa speculative decoding.
Each head predicts a future token offset from the current position.
"""

def __init__(self, hidden_size: int, vocab_size: int, offset: int):
super().__init__()
self.offset = offset # Predicts token at position t + offset
self.head = nn.Sequential(
nn.Linear(hidden_size, hidden_size, bias=False),
nn.SiLU(),
nn.Linear(hidden_size, vocab_size, bias=False)
)

def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
return self.head(hidden_states)


class MedusaModel(nn.Module):
"""
Wrapper that adds Medusa heads to an existing LLM.
Only the Medusa heads are trained; the base model is frozen.
"""

def __init__(self, base_model, num_heads: int = 5):
super().__init__()
self.base_model = base_model
hidden_size = base_model.config.hidden_size
vocab_size = base_model.config.vocab_size

self.medusa_heads = nn.ModuleList([
MedusaHead(hidden_size, vocab_size, offset=i + 1)
for i in range(num_heads)
])

def forward(self, input_ids: torch.Tensor):
# Run base model, get hidden states
outputs = self.base_model(
input_ids,
output_hidden_states=True
)
hidden_states = outputs.hidden_states[-1] # Last layer

# Base model logits (for position t)
base_logits = outputs.logits

# Medusa heads (for positions t+1, t+2, ...)
medusa_logits = [head(hidden_states) for head in self.medusa_heads]

return base_logits, medusa_logits


def medusa_tree_decode(
model: MedusaModel,
input_ids: torch.Tensor,
k: int = 5,
temperature: float = 1.0,
top_k_candidates: int = 5
) -> torch.Tensor:
"""
Simplified Medusa decoding with candidate tree.

The full Medusa implementation uses a candidate tree where
each head's top-k predictions create branches - the target
model verifies all branches simultaneously.
"""
with torch.no_grad():
base_logits, medusa_logits = model(input_ids)

# Get top candidates from base model (for current position)
base_probs = F.softmax(base_logits[0, -1] / temperature, dim=-1)
base_top = torch.topk(base_probs, top_k_candidates)

# Get top candidates from each Medusa head
candidates = [[token.item() for token in base_top.indices]]
for head_logits in medusa_logits[:k]:
head_probs = F.softmax(head_logits[0, -1] / temperature, dim=-1)
top_tokens = torch.topk(head_probs, top_k_candidates).indices
candidates.append([t.item() for t in top_tokens])

# In full Medusa: build tree of all candidate combinations,
# verify with base model in a single batched forward pass.
# Here we just return the most likely candidate sequence.
best_sequence = [c[0] for c in candidates]
return torch.tensor([best_sequence])

Medusa advantages over standard speculative decoding:

  • No separate draft model required
  • Draft heads trained cheaply (freeze base, train only heads)
  • Heads share the base model's rich representations
  • Can use tree-structured verification to evaluate multiple candidate trees

Medusa disadvantage: The heads predict independently (no autoregressive conditioning on each other's predictions), which limits acceptance rates compared to a full draft model.


EAGLE: Feature-Level Drafting

EAGLE (Li et al., 2024) improves on both standard speculative decoding and Medusa by drafting at the feature (hidden state) level rather than the token level:

Instead of predicting future tokens directly, EAGLE predicts future hidden states using a lightweight autoregressive model. These predicted hidden states are then passed through the target model's final layers to get token predictions.

This approach achieves higher acceptance rates (0.85–0.90 vs 0.75–0.85) because hidden states contain more information than one-hot token predictions, allowing better conditioning for future predictions.

Typical EAGLE speedup: 3–4× vs baseline decode, vs 2–3× for standard speculative decoding.


Self-Speculative Decoding

Self-speculative decoding uses the target model itself as the draft model by exiting early from intermediate layers:

  1. Run the target model's forward pass but exit after layer mm (e.g., after 50% of layers)
  2. Use these "shallow" outputs as draft token predictions
  3. Complete the full forward pass to verify (or reject) the shallow predictions

This eliminates the need for a separate draft model entirely. It works because early layer predictions are often correct for easy tokens.

def self_speculative_step(
model,
input_ids: torch.Tensor,
early_exit_layer: int,
k: int = 5,
temperature: float = 1.0
) -> Tuple[torch.Tensor, int]:
"""
Self-speculative decoding: use early exit as the draft model.

The target model runs a "shallow" forward pass for drafting,
then a full forward pass for verification.
Requires model to support early_exit_layer parameter.
"""
# Draft: early exit at layer early_exit_layer
draft_tokens = []
draft_probs = []
draft_ids = input_ids.clone()

for _ in range(k):
with torch.no_grad():
# Partial forward pass
outputs = model(
draft_ids,
output_hidden_states=True,
early_exit_layer=early_exit_layer # Custom parameter
)
# Use hidden state at exit layer as logits (via lm_head)
exit_hidden = outputs.hidden_states[early_exit_layer]
shallow_logits = model.lm_head(model.model.norm(exit_hidden))
probs = F.softmax(shallow_logits[0, -1] / temperature, dim=-1)
token = torch.multinomial(probs, 1)
draft_tokens.append(token.item())
draft_probs.append(probs[token.item()].item())
draft_ids = torch.cat([draft_ids, token.unsqueeze(0)], dim=1)

# Verify: full forward pass
with torch.no_grad():
full_outputs = model(draft_ids) # Full pass
# Verification proceeds as in standard speculative decoding...

n_accepted = k # Simplified - real implementation uses acceptance criterion
return draft_ids, n_accepted

Production Considerations

When to Use Speculative Decoding

ScenarioRecommendedWhy
Interactive chatYesLatency-sensitive; 2–3× speedup directly improves UX
Code completionYesHigh acceptance rate (predictable syntax); 3–4× speedup
Creative writingYesGood acceptance rate; user notices speed
Batch processingNoThroughput > latency; continuous batching is better
Very short outputsNoOverhead of draft/verify amortizes poorly
Very diverse outputs (T=1.5)MaybeLower acceptance rate reduces speedup

Infrastructure Requirements

def estimate_speculative_decoding_requirements(
target_params_b: float,
draft_params_b: float,
batch_size: int,
k: int = 5
) -> dict:
"""
Estimate memory and throughput for speculative decoding deployment.
"""
# Memory requirements
target_gpu_gb = target_params_b * 2 # FP16
draft_gpu_gb = draft_params_b * 2 # FP16
total_gpu_gb = target_gpu_gb + draft_gpu_gb # Both must fit simultaneously

# Throughput estimate
# Target: one forward pass every k+1 tokens (k draft + 1 verify)
# Draft: k forward passes every k+1 tokens
# Net: roughly (expected_tokens_per_step) / (cost_of_verify) improvement

alpha = 0.8 # Typical acceptance rate
expected_tokens = sum(alpha ** i for i in range(1, k + 1)) + 1
speedup = expected_tokens / (1 + k * (draft_params_b / target_params_b))

return {
"target_gpu_gb": target_gpu_gb,
"draft_gpu_gb": draft_gpu_gb,
"total_gpu_gb": total_gpu_gb,
"num_a100_80gb_needed": int(total_gpu_gb / 80) + 1,
"expected_speedup": round(speedup, 2),
"recommendation": (
"Fits on single node" if total_gpu_gb <= 640
else "Multi-node required"
)
}


# LLaMA-3 70B + LLaMA-3 8B draft
result = estimate_speculative_decoding_requirements(
target_params_b=70,
draft_params_b=8,
batch_size=8,
k=5
)
print("LLaMA-3 70B + LLaMA-3 8B Draft System:")
for k_name, v in result.items():
print(f" {k_name}: {v}")

vLLM Speculative Decoding Setup

from vllm import LLM, SamplingParams

def setup_vllm_speculative(
target_model: str = "meta-llama/Llama-3-70b-instruct",
draft_model: str = "meta-llama/Llama-3-8b-instruct",
num_speculative_tokens: int = 5
):
"""
Configure vLLM with speculative decoding.
vLLM handles PagedAttention for both target and draft models.
"""
llm = LLM(
model=target_model,
speculative_model=draft_model,
num_speculative_tokens=num_speculative_tokens,
tensor_parallel_size=4, # 4 GPUs for target 70B
# vLLM automatically handles KV cache for both models
)

sampling_params = SamplingParams(
temperature=0.8,
top_p=0.9,
max_tokens=512
)

return llm, sampling_params

Common Mistakes

:::danger Expecting linear speedup with draft length k Increasing kk beyond the acceptance rate's natural limit gives diminishing returns. At α=0.8\alpha = 0.8 and k=5k = 5, expected accepted tokens = 3.93. At k=10k = 10, it is only 4.46 - barely more. But the draft cost doubles. There is an optimal kk for each α\alpha: approximately k1/(1α)k^* \approx 1/(1-\alpha). For α=0.8\alpha = 0.8, optimal k5k \approx 5. For α=0.9\alpha = 0.9, optimal k10k \approx 10. Don't blindly use large kk. :::

:::danger Using mismatched tokenizers between draft and target Draft and target models must use the exact same tokenizer with the same vocabulary. Different tokenizers produce different token IDs for the same text - the draft model's token 1234 is not the target model's token 1234. Even minor tokenizer differences (different special tokens, different BPE merges) completely break the acceptance criterion. Always verify that draft and target models share identical tokenizer configurations before deploying. :::

:::warning Speculative decoding hurts throughput in high-batch scenarios Speculative decoding reduces per-request latency but can hurt throughput. At high batch sizes, continuous batching fills GPUs efficiently - adding speculative decoding overhead per request can reduce the effective number of requests served per second. Speculative decoding is a latency optimization, not a throughput optimization. Use it for interactive, latency-sensitive workloads. For batch processing pipelines, continuous batching alone (without speculative decoding) is typically better. :::

:::warning Low acceptance rate with high temperature Speculative decoding acceptance rate drops at high temperature (T>1.0T > 1.0) because the target model's distribution becomes flatter - the draft model's top predictions are less likely to match. At T=1.5T = 1.5, acceptance rate can drop to 0.5–0.6, severely reducing speedup. Consider reducing draft length kk at high temperatures or disabling speculative decoding for very creative tasks where high temperature is required. :::


Interview Questions

Q1: Explain speculative decoding. Why is it lossless?

Speculative decoding uses a small draft model to generate kk candidate tokens, then uses the large target model to verify all kk in a single parallel forward pass. Each candidate is accepted with probability min(1,ptarget/pdraft)\min(1, p_{\text{target}} / p_{\text{draft}}) - rejection sampling. If rejected, a correction token is sampled from normalize(max(0,ptargetpdraft))\text{normalize}(\max(0, p_{\text{target}} - p_{\text{draft}})). The acceptance criterion guarantees that the marginal distribution of each output token exactly matches the target model's distribution - it is exactly equivalent to sampling from the target model directly. This mathematical guarantee means you can use speculative decoding in production without any quality regression.

Q2: What is the expected speedup formula and what drives acceptance rate?

Expected speedup E[accepted tokens]/cost per step\approx E[\text{accepted tokens}] / \text{cost per step}, where E[accepted]=i=1kαiE[\text{accepted}] = \sum_{i=1}^{k} \alpha^i and cost includes kk draft passes plus 1 target pass. The key variable is α\alpha, the per-token acceptance rate. High α\alpha comes from: (1) draft and target model trained on the same distribution, (2) similar model family and architecture, (3) appropriate size ratio (~10:1 is common), and (4) lower sampling temperature (greedy-leaning distributions have higher acceptance rates). For typical code generation (deterministic syntax), α\alpha reaches 0.85–0.90 and speedup is 3–4×.

Q3: What is the difference between standard speculative decoding and Medusa?

Standard speculative decoding uses a separate draft model - a distinct model loaded in parallel. Medusa adds multiple lightweight prediction heads to the target model itself. Each head predicts a future token (head ii predicts token at position t+it+i). The heads share the target model's rich hidden states, so they can be smaller than a full draft model. Medusa has lower hardware requirements (no second model) but typically lower acceptance rates than a well-matched separate draft model because the heads predict future tokens independently without autoregressive conditioning on each other.

Q4: When should you NOT use speculative decoding in production?

(1) Batch processing workloads where throughput matters more than latency - continuous batching at high batch sizes is more efficient. (2) Very short outputs (under 20 tokens) - the setup overhead amortizes poorly. (3) Very high temperature (T>1.3T > 1.3) creative tasks - low acceptance rates reduce speedup below break-even. (4) When you lack a well-matched draft model of the same family - mismatched models have low acceptance rates. (5) When GPU memory is tight - you need both target and draft models in memory simultaneously.

Q5: How does self-speculative decoding work and what are its trade-offs vs using a separate draft model?

Self-speculative decoding runs the target model with early exit - stopping after layer mm (e.g., 40% of total layers) to generate draft tokens, then completing the full forward pass for verification. Advantages: no extra memory for a separate model, no tokenizer mismatch issues, simpler infrastructure. Disadvantages: lower acceptance rates than a well-matched separate draft model (early-exit predictions are less accurate), and the early-exit forward pass still costs m/Lm/L fraction of the full forward pass, so the effective draft cost is not negligible. Typically provides 1.5–2× speedup vs 2–3× for separate draft models.

Q6: How does vLLM integrate speculative decoding with PagedAttention?

vLLM manages separate PagedAttention KV caches for both the draft and target models. During the draft phase, tokens are generated autoregressively using the draft model's KV cache. During verification, the target model processes the full speculative sequence (prefix + k draft tokens) in one forward pass, using its own KV cache. When draft tokens are rejected, vLLM rewinds the target model's KV cache to the last accepted position and discards the rejected tokens' cached K/V values. The BlockManager allocates blocks for both draft and target models' KV caches from the same GPU memory pool, managed with the same reference-counting and copy-on-write mechanisms as regular continuous batching.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Speculative Decoding demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.