Skip to main content

Continuous Batching

The Production Scenario

It is 2:00 PM on a Tuesday and your LLM serving system is drowning. You have 200 concurrent users firing requests at your API. Some are writing one-line emails - "Reply to this message saying I'll be 10 minutes late." Others are asking the model to summarize a 10,000-word research paper. Both requests land in the same batch.

You watch GPU utilization on your monitoring dashboard. It spikes to 98% for a few seconds, then drops to 30%, then 98%, then 40%. The pattern is erratic. You are generating roughly 800 tokens per second but your hardware should be capable of 3,000. Your p99 latency is 45 seconds even though your median latency is 4 seconds. Users filing support tickets all have one thing in common: they sent the kind of short request that should have finished in two seconds, but they waited 45 seconds anyway.

The culprit is static batching. When your serving system groups requests together, it treats the batch as an atomic unit. The batch starts together. The batch ends together. Every request in a batch must wait for the longest sequence to finish before the GPU can pick up the next group of waiting requests. Your short email-reply request finished generating 400ms after it started, but it is trapped waiting next to a summarization request still churning through token 3,000 of its 4,000-token output. The GPU slot that could be serving new users is held hostage by a request that is 99% complete.

Multiply this across 200 concurrent users with wildly varying output lengths, and you have a disaster. The users with short tasks - who should be your happiest customers - have the worst experience. Your throughput number looks reasonable in aggregate, but your tail latency is brutal and your GPU utilization is full of holes.

Continuous batching, introduced in the Orca paper (Yu et al., 2022), solves this by making a scheduling decision at every single decode step. When a sequence finishes, its GPU slot is freed immediately and assigned to a new waiting request. The batch is not static - it is continuously reshaped as sequences complete. The result is GPU utilization that stays near 100% and p99 latencies that collapse toward p50.


Why This Exists: The Static Batching Problem

How Static Batching Works

To understand why continuous batching matters, you need to internalize exactly why static batching wastes compute.

In static batching, the serving system:

  1. Collects a group of incoming requests into a batch
  2. Runs a prefill pass to process all the prompts in parallel (this is fast - transformers love long sequences)
  3. Runs the decode loop: one step per token, generating one new token per sequence per step
  4. When a sequence hits its EOS token or max length, it is "done" - but its tensor slot stays occupied
  5. The batch ends when every sequence is done
  6. Only then does the system pick up the next batch from the queue

Step 4 is the catastrophe. A sequence that finishes at step 50 occupies a tensor slot until the last sequence finishes at step 2,000. For 1,950 steps, you are doing useless work: computing attention over padding tokens, wasting memory bandwidth, blocking a GPU slot that could be generating useful tokens for a new user.

The Length Distribution Problem

Real user requests have a highly skewed length distribution. In a typical production chat application:

Request typeTypical output length
Quick factual answer10–50 tokens
Code snippet100–300 tokens
Email draft200–500 tokens
Document summary500–1000 tokens
Long-form analysis1000–4000 tokens

If you batch 8 requests together and one of them is a long-form analysis, all 8 slots are held until that one finishes. The 7 short requests could have freed their slots at step 50, allowed 7 new requests to start, finished at step 100, freed their slots again, allowed 7 more new requests to start - and so on. Static batching throws away all of that cascading opportunity.

Static Batching - GPU slot utilization over time

Slot 1: [====REQUEST-A=======================][ REQUEST-E ]
Slot 2: [===REQUEST-B================ ][ REQUEST-F ]
Slot 3: [=REQUEST-C===== ][ REQUEST-G ]
Slot 4: [====REQUEST-D========================][ REQUEST-H ]
^ ^
Batch starts All slots free,
new batch can start

Between ^ marks: slots 2 and 3 are idle, wasted GPU
Continuous Batching - GPU slot utilization over time

Slot 1: [====REQUEST-A=======================REQUEST-E====...
Slot 2: [===REQUEST-B================REQUEST-F=======REQUEST-J..
Slot 3: [=REQUEST-C=====REQUEST-G====REQUEST-K=====REQUEST-N..
Slot 4: [====REQUEST-D========================REQUEST-H====..
^ ^ ^ ^ ^
Slots freed | | | |
and refilled | | | |
immediately | | | New requests start immediately
Three requests finish at different times

Continuous batching keeps every slot busy at every step.


Historical Context

The key insight in the Orca paper (Yu et al., 2022) was to observe that the decode loop is not monolithic. Each decode step - generating one token per in-flight sequence - is an independent operation. There is no technical reason why the set of sequences processed at step tt must be the same set processed at step t+1t+1.

Before Orca, the dominant serving frameworks (Triton Inference Server with TensorRT, FasterTransformer) treated a batch as a fixed unit throughout its lifetime. This came from computer vision and NLP classification inference, where all inputs in a batch have the same shape and all complete at the same time. Language model generation is fundamentally different: output lengths are unknown in advance and highly variable.

Orca proposed iteration-level scheduling: at every decode step, the scheduler examines the set of active sequences, removes any that just finished (hit EOS), and inserts new sequences from the waiting queue to fill the freed slots. The batch changes shape dynamically, but the GPU kernel still runs over a batch - it just operates on a different set of sequences at each step.

The Orca paper reported 36.9× higher throughput than FasterTransformer for certain workloads. This number is dramatic, but the baseline was particularly bad at handling variable-length outputs. Production gains in practice are typically 5–15× over naive static batching, which is still transformative.

vLLM (2023) combined continuous batching with PagedAttention to handle the KV cache memory fragmentation problem that arises when sequences have different lengths and are being inserted and removed dynamically. We cover PagedAttention in the KV Cache lesson - this lesson focuses on the scheduling mechanics.


How Continuous Batching Works

The Scheduling Loop

The core scheduler runs at every decode iteration:

While requests in queue or active:
1. Check all active sequences for EOS / max-length completion
2. Remove completed sequences, mark their KV cache slots as free
3. From the waiting queue, pull new requests that fit in available memory
4. For each new request: run its prefill to populate KV cache
5. Run one decode step over the updated set of active sequences
6. Return new tokens to clients for streaming
7. Repeat

The key decision point is step 3: which waiting requests to admit and when. This is the scheduling policy, and different policies optimize for different objectives (throughput vs fairness vs latency).

Prefill vs Decode: The Phase Difference

Continuous batching introduces a critical complexity: prefill and decode are not the same operation, and mixing them in a single batch creates tension.

Prefill (processing a prompt):

  • Processes all input tokens in one forward pass using full attention
  • Compute-bound: you are doing O(n2)O(n^2) attention work for an nn-token prompt
  • Generates the first output token as a side effect
  • GPU utilization is high, memory bandwidth is not the bottleneck

Decode (generating each new token):

  • Processes exactly one new token per sequence, attends to all KV cache history
  • Memory-bandwidth-bound: reading the KV cache dominates the cost
  • GPU compute is severely underutilized per step
  • Throughput is measured in tokens/second, latency in milliseconds per step

When a new request arrives and needs prefill, you have a choice:

  1. Pause decoding of existing sequences, run the prefill, then resume decoding
  2. Run prefill as part of the decode step for the new sequence

Option 1 wastes in-flight sequences (latency spike for existing users). Option 2 mixes two different computational profiles in one batch, which can reduce efficiency but keeps latency lower.

Chunked Prefill

Chunked prefill (proposed in Sarathi-Serve, 2023) provides a better solution: break a long prompt into chunks and prefill one chunk per decode iteration.

Instead of running a 2,000-token prefill all at once (which pauses decoding for hundreds of milliseconds), you process 256 tokens of the new prompt per decode step while continuing to generate tokens for existing requests. This caps the latency impact on existing sequences to the cost of a single decode step.

Without chunked prefill:
Step 1: Decode [seq1, seq2, seq3]
Step 2: PREFILL new request - 2000 tokens - seq1/seq2/seq3 wait
Step 3: Decode [seq1, seq2, seq3, new_req]

With chunked prefill (chunk_size=256):
Step 1: Decode [seq1, seq2, seq3] + prefill chunk 0-255 of new_req
Step 2: Decode [seq1, seq2, seq3] + prefill chunk 256-511 of new_req
...
Step 8: Decode [seq1, seq2, seq3] + prefill chunk 1792-2047 of new_req
Step 9: Decode [seq1, seq2, seq3, new_req] - new_req now ready to decode

Chunked prefill trades slightly higher time-to-first-token (TTFT) for dramatically better time-between-tokens (TBT) for existing requests.


Preemption: Handling Memory Pressure

One challenge with continuous batching is memory management. As you admit more sequences and their KV caches grow, you can run out of GPU memory. This creates a dilemma: you already made a promise to start those requests, but you cannot continue them without memory.

Preemption is the mechanism for handling this gracefully. When memory is under pressure, the scheduler can:

  1. Swap to CPU: Move the KV cache for lower-priority sequences to CPU DRAM (16 GB/s bandwidth, very slow)
  2. Recompute: Evict the KV cache entirely and regenerate it when the sequence resumes (wastes compute, but avoids slow CPU-GPU transfer for short sequences)

Preemption priority policies:

  • FCFS (First Come First Served): Preempt the most recently admitted sequence - preserve older, further-along sequences
  • LJF (Longest Job First): Preempt the sequence with the most remaining tokens to generate - maximize immediate throughput
  • Priority queues: Admin-configured priorities per user or request type
from dataclasses import dataclass, field
from typing import List, Optional
from enum import Enum
import heapq


class SequenceStatus(Enum):
WAITING = "waiting"
PREFILLING = "prefilling"
DECODING = "decoding"
PREEMPTED = "preempted"
FINISHED = "finished"


@dataclass
class Sequence:
"""Represents one in-flight LLM request."""
seq_id: int
prompt_tokens: List[int]
output_tokens: List[int] = field(default_factory=list)
status: SequenceStatus = SequenceStatus.WAITING
priority: int = 0
kv_cache_blocks: List[int] = field(default_factory=list)

@property
def num_tokens(self) -> int:
return len(self.prompt_tokens) + len(self.output_tokens)

@property
def is_finished(self) -> bool:
return self.status == SequenceStatus.FINISHED

def __lt__(self, other):
# Higher priority number = higher priority
return self.priority > other.priority


class ContinuousBatchingScheduler:
"""
Simplified continuous batching scheduler.

Real schedulers (vLLM, TGI) are significantly more complex,
handling PagedAttention block allocation, chunked prefill,
tensor parallel coordination, and multi-GPU topologies.
"""

def __init__(
self,
max_batch_size: int = 32,
max_tokens_in_batch: int = 4096,
preemption_mode: str = "recompute",
):
self.max_batch_size = max_batch_size
self.max_tokens_in_batch = max_tokens_in_batch
self.preemption_mode = preemption_mode

self.waiting: List[Sequence] = [] # Not yet started
self.running: List[Sequence] = [] # Currently being decoded
self.preempted: List[Sequence] = [] # Swapped out, will resume

self._step = 0

def add_request(self, seq: Sequence) -> None:
"""Add a new request to the waiting queue."""
heapq.heappush(self.waiting, seq)

def _count_running_tokens(self) -> int:
"""Total tokens across all running sequences (prompt + output so far)."""
return sum(s.num_tokens for s in self.running)

def _admit_new_sequences(self) -> List[Sequence]:
"""Pull sequences from waiting queue into running set."""
newly_admitted = []

while self.waiting:
if len(self.running) + len(newly_admitted) >= self.max_batch_size:
break

candidate = self.waiting[0]
tokens_if_admitted = (
self._count_running_tokens() + candidate.num_tokens
)

if tokens_if_admitted > self.max_tokens_in_batch:
break

heapq.heappop(self.waiting)
candidate.status = SequenceStatus.PREFILLING
newly_admitted.append(candidate)

return newly_admitted

def _preempt_if_needed(self) -> int:
"""
Preempt lowest-priority running sequence if memory is tight.
Returns number of sequences preempted.
"""
# In a real system: check actual GPU memory usage
# Here: simplified check on token count
if self._count_running_tokens() <= self.max_tokens_in_batch:
return 0

if not self.running:
return 0

# Preempt the lowest-priority sequence
victim = min(self.running, key=lambda s: s.priority)
self.running.remove(victim)

if self.preemption_mode == "swap":
victim.status = SequenceStatus.PREEMPTED
self.preempted.append(victim)
else: # recompute
# Drop KV cache, put back in waiting queue
victim.kv_cache_blocks.clear()
victim.output_tokens.clear()
victim.status = SequenceStatus.WAITING
heapq.heappush(self.waiting, victim)

return 1

def schedule_step(self) -> dict:
"""
Run one scheduling iteration.

Returns info about what happened this step.
"""
self._step += 1

# 1. Check for completed sequences (EOS token generated)
completed = [s for s in self.running if s.is_finished]
for seq in completed:
self.running.remove(seq)

# 2. Possibly resume preempted sequences
resumed = []
for seq in list(self.preempted):
if len(self.running) < self.max_batch_size:
self.preempted.remove(seq)
seq.status = SequenceStatus.PREFILLING
resumed.append(seq)

# 3. Admit new sequences from waiting queue
newly_admitted = self._admit_new_sequences()

# 4. Add newly admitted to running set
for seq in newly_admitted + resumed:
self.running.append(seq)
seq.status = SequenceStatus.DECODING

# 5. Check if we need to preempt anything
n_preempted = self._preempt_if_needed()

# 6. Run one decode step
# (In reality: GPU kernel runs forward pass here)
for seq in self.running:
seq.output_tokens.append(0) # placeholder for generated token
# Check if EOS would be generated (simplified)
if len(seq.output_tokens) >= 100: # max output length
seq.status = SequenceStatus.FINISHED

return {
"step": self._step,
"running": len(self.running),
"waiting": len(self.waiting),
"completed_this_step": len(completed),
"admitted_this_step": len(newly_admitted),
"preempted_this_step": n_preempted,
}

Simulating a Continuous Batching Scheduler

import random
import statistics
from typing import Dict

random.seed(42)


def simulate_static_batching(
requests: List[Dict],
batch_size: int = 8,
) -> Dict:
"""
Simulate static batching: process requests in fixed-size batches.
Each request in a batch waits for the longest one to finish.
"""
total_steps = 0
latencies = []
batch_utilization = []

for i in range(0, len(requests), batch_size):
batch = requests[i : i + batch_size]
max_output_len = max(r["output_tokens"] for r in batch)

# Each slot is either active (generating) or idle (waiting for longest)
for step in range(max_output_len):
active_in_step = sum(
1 for r in batch if step < r["output_tokens"]
)
batch_utilization.append(active_in_step / batch_size)
total_steps += 1

# Latency for each request: wait until batch finishes
for req in batch:
latencies.append(max_output_len)

return {
"total_steps": total_steps,
"throughput_rps": len(requests) / total_steps,
"mean_latency": statistics.mean(latencies),
"p99_latency": sorted(latencies)[int(len(latencies) * 0.99)],
"mean_utilization": statistics.mean(batch_utilization),
}


def simulate_continuous_batching(
requests: List[Dict],
max_batch_size: int = 8,
) -> Dict:
"""
Simulate continuous batching: replace finished sequences immediately.
"""
# Sort by arrival order
queue = list(requests)
active = []
latencies = []
step = 0
batch_utilization = []
requests_completed = 0

# Pre-populate the active set
while queue and len(active) < max_batch_size:
req = queue.pop(0)
req = dict(req)
req["tokens_generated"] = 0
req["start_step"] = step
active.append(req)

while active or queue:
# Decode step: generate one token per active sequence
finished_this_step = []
for req in active:
req["tokens_generated"] += 1
if req["tokens_generated"] >= req["output_tokens"]:
finished_this_step.append(req)

# Record utilization
batch_utilization.append(len(active) / max_batch_size)

# Remove finished sequences
for req in finished_this_step:
active.remove(req)
latency = step - req["start_step"] + 1
latencies.append(latency)
requests_completed += 1

# Fill freed slots with new requests
while queue and len(active) < max_batch_size:
new_req = queue.pop(0)
new_req = dict(new_req)
new_req["tokens_generated"] = 0
new_req["start_step"] = step
active.append(new_req)

step += 1

return {
"total_steps": step,
"throughput_rps": requests_completed / step,
"mean_latency": statistics.mean(latencies),
"p99_latency": sorted(latencies)[int(len(latencies) * 0.99)],
"mean_utilization": statistics.mean(batch_utilization),
}


# Generate synthetic requests with skewed output lengths
def generate_requests(n: int) -> List[Dict]:
"""Skewed distribution: most requests are short, some are very long."""
requests = []
for i in range(n):
# 70% short (10-100 tokens), 20% medium (100-500), 10% long (500-2000)
r = random.random()
if r < 0.7:
output_len = random.randint(10, 100)
elif r < 0.9:
output_len = random.randint(100, 500)
else:
output_len = random.randint(500, 2000)
requests.append({"id": i, "output_tokens": output_len})
return requests


requests = generate_requests(100)

static_results = simulate_static_batching(requests, batch_size=8)
continuous_results = simulate_continuous_batching(requests, max_batch_size=8)

print("=== Static Batching ===")
print(f" Total steps: {static_results['total_steps']}")
print(f" Throughput: {static_results['throughput_rps']:.3f} req/step")
print(f" Mean latency: {static_results['mean_latency']:.1f} steps")
print(f" P99 latency: {static_results['p99_latency']:.0f} steps")
print(f" Mean utilization: {static_results['mean_utilization']:.1%}")

print()
print("=== Continuous Batching ===")
print(f" Total steps: {continuous_results['total_steps']}")
print(f" Throughput: {continuous_results['throughput_rps']:.3f} req/step")
print(f" Mean latency: {continuous_results['mean_latency']:.1f} steps")
print(f" P99 latency: {continuous_results['p99_latency']:.0f} steps")
print(f" Mean utilization: {continuous_results['mean_utilization']:.1%}")

throughput_ratio = continuous_results["throughput_rps"] / static_results["throughput_rps"]
latency_ratio = static_results["p99_latency"] / continuous_results["p99_latency"]
print(f"\nThroughput improvement: {throughput_ratio:.1f}x")
print(f"P99 latency improvement: {latency_ratio:.1f}x")

Throughput vs Latency Trade-off

Continuous batching improves throughput, but it introduces a fundamental tension with latency.

Higher batch size = higher throughput, higher per-request latency.

When you run 32 sequences simultaneously, each decode step takes longer than running 8 sequences (more attention computation, more memory bandwidth). Each individual request therefore generates tokens slightly more slowly, even though total tokens per second is higher.

The right trade-off depends on your application:

ApplicationPriorityRecommended batch size
Interactive chatLow latency (fast TBT)Smaller (8–16)
Document summarization APIHigh throughputLarger (32–64)
Batch processing pipelineMaximum throughputMaximum GPU memory allows
Real-time voice assistantUltra-low latency1–4

Key metrics to optimize:

  • TTFT (Time To First Token): How long until the first token is streamed back - perceived as "responsiveness"
  • TBT (Time Between Tokens): How fast tokens stream after the first - perceived as "speed"
  • Throughput (tokens/sec): Total tokens generated per second - relevant for cost

Continuous batching primarily improves throughput and reduces TTFT for queued requests. TBT may increase slightly with larger batch sizes.


PagedAttention and Continuous Batching Together

Continuous batching introduces a memory management problem: KV cache blocks must be dynamically allocated and freed as sequences start and finish. With static batching, you can pre-allocate KV cache of fixed size for each slot. With continuous batching, you need something more flexible.

PagedAttention (vLLM, 2023) solves this by managing KV cache like virtual memory:

  • KV cache is divided into fixed-size pages (blocks), typically 16 tokens each
  • Each page can be allocated to any sequence
  • A block table maps each sequence's logical KV positions to physical GPU memory pages
  • Pages from completed sequences are immediately reclaimed and given to new sequences

The combination is powerful:

  • Continuous batching keeps GPU compute fully utilized by always having sequences to decode
  • PagedAttention keeps GPU memory efficiently utilized by eliminating fragmentation and pre-allocation waste
Without PagedAttention (static KV allocation):

GPU memory: [====SEQ-A====|====SEQ-B====|====SEQ-C====|====SEQ-D====]
2000 tok max 2000 tok max 2000 tok max 2000 tok max

SEQ-A uses 50 tokens, wastes 1950 token slots
SEQ-C finishes at token 100, leaves 1900-token hole

With PagedAttention (paged KV allocation):

GPU memory: [PAGE][PAGE][PAGE][PAGE][PAGE][PAGE][PAGE][PAGE]
Seq-A Seq-A Seq-B Seq-D Seq-E free Seq-B free

Pages assigned on demand, freed immediately on completion
No internal fragmentation, no wasted reservation

Production Engineering Notes

Batch Size Tuning

There is no universal optimal batch size. Profile your specific workload:

# Profile throughput vs latency at different batch sizes
import time
from statistics import mean, quantiles

def benchmark_batch_size(
server_url: str,
batch_size: int,
n_requests: int = 200,
) -> dict:
"""
Send n_requests at specified concurrency and measure throughput + latency.
Use with vLLM, TGI, or any OpenAI-compatible server.
"""
import concurrent.futures
import requests as http_requests

prompt = "Summarize the following: " + "word " * 100 # ~100 token prompt
latencies = []
start = time.time()

def single_request() -> float:
t0 = time.time()
response = http_requests.post(
f"{server_url}/v1/completions",
json={
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
"prompt": prompt,
"max_tokens": 200,
"temperature": 0.0,
},
timeout=120,
)
return time.time() - t0

with concurrent.futures.ThreadPoolExecutor(max_workers=batch_size) as executor:
futures = [executor.submit(single_request) for _ in range(n_requests)]
for f in concurrent.futures.as_completed(futures):
latencies.append(f.result())

total_time = time.time() - start
qs = quantiles(latencies, n=100)

return {
"batch_size": batch_size,
"throughput_rps": n_requests / total_time,
"mean_latency_s": mean(latencies),
"p50_latency_s": qs[49],
"p99_latency_s": qs[98],
}

Queue Management

In production, manage separate queues for different request types:

  • Priority queue: paid tier vs free tier, or SLA-bound vs best-effort
  • Size-based routing: route requests with long prompts to instances with more memory
  • Timeout-based eviction: drop requests that have waited too long rather than letting queue grow unboundedly
# Priority queue example for multi-tier serving
import heapq
import time


@dataclass
class PrioritizedRequest:
priority: int # 0 = highest priority
arrival_time: float
request_id: str
prompt: str
max_tokens: int

def __lt__(self, other):
# Tie-break by arrival time (FCFS within same priority tier)
if self.priority == other.priority:
return self.arrival_time < other.arrival_time
return self.priority < other.priority


class PriorityScheduler:
def __init__(self):
self._heap = []
self._counter = 0

def add(
self,
prompt: str,
max_tokens: int,
tier: str = "standard", # "premium" | "standard" | "batch"
) -> str:
priority = {"premium": 0, "standard": 1, "batch": 2}[tier]
req_id = f"req_{self._counter}"
self._counter += 1

req = PrioritizedRequest(
priority=priority,
arrival_time=time.time(),
request_id=req_id,
prompt=prompt,
max_tokens=max_tokens,
)
heapq.heappush(self._heap, req)
return req_id

def next(self) -> Optional[PrioritizedRequest]:
if self._heap:
return heapq.heappop(self._heap)
return None

@property
def queue_depth(self) -> int:
return len(self._heap)

:::danger Common Mistake: Ignoring TTFT Under Load

Many teams optimize for throughput and forget TTFT. Under high load with continuous batching, a new request may sit in the queue for seconds before its prefill slot opens. This is invisible in throughput metrics but brutal in user experience.

Always monitor TTFT separately from TBT and total latency. Set alerts on TTFT p95 and p99 - these will tell you when your queue is backing up before your throughput metrics show anything wrong.

Solution: Set a maximum queue depth and shed load (return 429) rather than letting TTFT grow to 30+ seconds. Inform clients with a Retry-After header. :::

:::warning Chunked Prefill Trade-offs

Chunked prefill is not free. Breaking a 2,000-token prefill into 8 chunks of 256 tokens adds scheduling overhead and increases time-to-first-token compared to running the prefill in one shot (at zero concurrent requests). The benefit only materializes when there are active decoding sequences that would otherwise be paused.

At low concurrency, disable chunked prefill or use large chunk sizes. At high concurrency, smaller chunks improve fairness and p99 latency for existing users. :::

:::warning Preemption Creates Recompute Cost

When a sequence is preempted and uses recompute mode (KV cache evicted, sequence restarted), the entire prefill cost is paid again. For a 4,000-token prompt, this can be significant.

Monitor preemption rate in production. If it is above ~1%, you are either admitting too many sequences too aggressively or your KV cache budget is too small. Increase GPU memory, reduce max sequence length, or tune the admission threshold. :::


Interview Questions

Q1: What is the core problem that continuous batching solves, and why didn't it exist in earlier inference systems?

Static batching treats a batch as an atomic unit - all sequences start together and all must finish before the batch completes. This wastes GPU slots when short sequences finish early, forcing them to idle while long sequences continue. The wasted capacity is especially harmful with LLMs because output length variance is extreme (10 tokens to 4,000 tokens in the same workload).

Earlier systems (Triton, FasterTransformer) came from computer vision and classification where all batch members have identical output sizes and complete simultaneously. They had no mechanism to dynamically reshape the batch mid-execution. The Orca paper (2022) reframed LLM serving as an iteration-level scheduling problem rather than a batch-level problem, enabling continuous replacement of completed sequences.

Q2: Explain the difference between prefill and decode phases and why mixing them in continuous batching is non-trivial.

Prefill processes all input tokens in one forward pass using full self-attention - it is compute-bound with O(n2)O(n^2) attention work. Decode generates one token per step, reading the entire KV cache - it is memory-bandwidth-bound. These two operations have different compute profiles and occupy different amounts of time per token.

When a new request arrives mid-batch, its prefill interrupts ongoing decoding. Chunked prefill addresses this by breaking the prefill into small pieces processed one per decode step, so existing sequences experience only a small per-step overhead rather than a full pause. But this requires careful scheduling to avoid wasting compute or creating head-of-line blocking for the new request's own TTFT.

Q3: How does PagedAttention enable continuous batching to work efficiently at the memory level?

Continuous batching requires dynamically allocating and freeing KV cache as sequences arrive and depart. With static allocation, each sequence reserves a fixed KV cache block (e.g., for 2,048 tokens) regardless of actual usage - this causes internal fragmentation (most sequences use less than max) and external fragmentation (freed blocks may not be contiguous for the next request).

PagedAttention manages KV cache like virtual memory: fixed-size pages (typically 16 tokens) are allocated on demand from a page pool. A block table maps each sequence's logical positions to physical pages. When a sequence finishes, its pages are returned to the pool immediately. The next sequence gets those pages regardless of their physical location. This eliminates all fragmentation and allows the allocator to keep GPU memory utilization near 100%.

Q4: What are the key metrics to monitor for a continuous batching inference server, and what does each tell you?

  • TTFT (Time To First Token): Latency from request arrival to first token streamed back. A rising TTFT indicates queue backup or slow prefill.
  • TBT (Time Between Tokens): Latency between consecutive tokens for a streaming response. Rising TBT indicates batch size is too large for the target latency SLA.
  • Throughput (tokens/second): Total output tokens generated per second across all requests. Primary measure of hardware efficiency.
  • Queue depth: Number of requests waiting for a batch slot. A growing queue means admission rate exceeds completion rate - need to scale or shed load.
  • Preemption rate: Percentage of decode steps that trigger preemption. High rates indicate memory pressure - increase GPU memory or reduce max concurrent sequences.
  • GPU utilization: Should stay above 90% with well-tuned continuous batching. Drops below 70% indicate batch size too small or queue starvation.

Q5: A user complains that their 10-token request took 40 seconds to get a response. Your throughput metrics look fine. What happened and how do you fix it?

With continuous batching, TTFT includes queue wait time. The 40 seconds almost certainly means this short request sat in the queue behind many long requests. Your throughput metric looks fine because the server is busy - it is just busy with long requests that arrived earlier.

Solutions:

  1. Priority queuing: Route short requests (estimated by prompt length or request tier) to a higher-priority queue.
  2. Separate pools: Dedicate some batch slots exclusively to requests estimated to be short - prevents long requests from monopolizing all slots.
  3. Request routing: Run two server instances - one sized for long requests, one for short - and route based on estimated output length.
  4. Admission control: Set a maximum queue wait time. Return 429 after N seconds rather than letting requests wait indefinitely.
  5. Monitoring: Add TTFT histogram alerts. A healthy system should have p99 TTFT within 2× of p50 - if the ratio is 20×, you have a priority problem.

Q6: How does continuous batching interact with streaming responses? Does the client receive tokens faster or slower?

With streaming, the client receives each token as it is generated. Continuous batching does not change the token generation rate for a single sequence - each decode step still generates one token per sequence. What changes is the server-side throughput (more total tokens/second across all sequences).

A single user's perceived streaming speed (TBT) may actually be slightly slower with large batch sizes because each decode step is slightly more expensive (more sequences to attend to). The benefit of continuous batching accrues to the system's overall capacity, not to an individual request's TBT.

For latency-sensitive applications, tune batch size to keep TBT within SLA. For throughput-maximized batch workloads (offline summarization, data processing), maximize batch size.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Inference Batching & Throughput demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.