:::tip 🎮 Interactive Playground Visualize this concept: Try the Latency vs Throughput demo on the EngineersOfAI Playground - no code required. :::
Low-Latency Inference Patterns
The Production Scenario
Google Search has a hard constraint that most engineers outside the company do not appreciate: the entire search pipeline - query understanding, document retrieval, ranking, snippet generation, ad auction - must complete in under 200ms total. The ML ranking model that scores hundreds of documents against the user's query is allocated roughly 50ms of that budget. At 99,000 queries per second (Google's approximate rate), that means 99 billion model evaluations per day, each completing in under 50ms.
Meta's News Feed ranking has a tighter constraint. The feed must appear responsive as a user scrolls - that means new content must be ranked before the user reaches it, typically within 30ms of the scroll event triggering a load. The ranking model operates on a batch of 500 candidate posts with tens of features each, and it must score and rank all 500 within that 30ms window.
These are not exceptional cases. Ad systems, fraud detection, recommendation feeds, and autocomplete all operate at latency requirements that feel physically impossible when you first encounter them - but are achieved routinely in production using the specific engineering patterns this lesson covers.
The patterns are a hierarchy: start with what your hardware can do (the ceiling), optimize the model for that hardware (remove waste), optimize the data path (remove serialization and copying), and finally, pre-compute what you can so that inference at serving time is just a lookup.
Why This Exists - The Latency Ceiling of Naive Deployment
A PyTorch model in eager execution mode with FP32 weights, served through FastAPI with JSON serialization, is as far from the latency floor as you can get. Every layer of that stack adds overhead:
- JSON serialization: 0.5-2ms for a moderately-sized feature vector
- Python interpreter overhead: 0.1-0.5ms per request (GIL, object allocation)
- FP32 computation: 2-8x slower than INT8 on modern hardware
- Sequential execution: One operation at a time, not fused
- CUDA kernel launch overhead: 0.05-0.2ms per GPU kernel (accumulates with many small ops)
A carefully optimized stack removes each of these overheads. The result can be 10-20x faster than the naive implementation - which is the difference between a 40ms latency and a 2ms latency for the same model and the same hardware.
Historical Context
Low-latency ML inference has been a focus at internet companies since at least 2012, when deep learning began outperforming traditional models in production. NVIDIA's TensorRT (2016) was the first widely available tool to automate inference optimization for GPU - it applies operator fusion, precision calibration, and kernel auto-tuning to minimize inference latency. The concept of quantization for inference existed in the signal processing literature for decades but was systematically applied to neural networks by Jacob et al. (Google Brain) in their quantization-aware training paper (2018).
The modern inference optimization stack - quantization to INT8 or INT4, kernel compilation with torch.compile or TensorRT, operator fusion, continuous batching - represents the cumulative learning of a decade of production inference engineering at Google, Meta, NVIDIA, and the major cloud providers.
The Latency Stack
Understanding latency requires knowing where time is actually spent:
Model computation is the largest single component and the most controllable through optimization. Every other component is overhead around the model.
Hardware Choices for Low Latency
The hardware you choose sets the ceiling on achievable latency:
| Hardware | Best Latency (small model) | Best for |
|---|---|---|
| CPU (modern server) | 0.5-5ms | Very small models, high request rate without batching |
| GPU A10G | 1-10ms | Medium models with batching |
| GPU H100 | 0.2-2ms | Large models, highest throughput |
| FPGA (Xilinx, Intel) | 0.01-0.5ms | Ultra-low latency, fixed model, specialized |
| Apple Neural Engine | 0.5-3ms | On-device iOS inference |
CPU vs GPU trade-off for latency: For a single request (batch size 1), modern CPUs can often match or beat GPUs due to GPU kernel launch overhead (0.05-0.2ms per kernel, which dominates when the model is small). GPUs win when you can batch many requests together. For p99 latency requirements under 5ms with no batching, a CPU implementation with ONNX Runtime may be faster than a naive GPU implementation.
Model Optimization for Latency
Quantization
Quantization reduces model weight precision from FP32 (32-bit) to INT8 (8-bit) or INT4 (4-bit). This provides two benefits: 2-4x reduction in model size (fits more in cache/memory) and 2-4x speedup in computation (INT8 SIMD operations are faster than FP32).
# quantization_for_latency.py
import torch
import torch.nn as nn
from torch.quantization import quantize_dynamic, prepare_qat, convert
class MLP(nn.Module):
def __init__(self, input_dim: int, hidden_dim: int, output_dim: int):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, output_dim),
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.layers(x)
def apply_dynamic_quantization(model: nn.Module) -> nn.Module:
"""
Dynamic quantization: weights are quantized to INT8, activations
are computed in FP32. Simplest approach, no calibration needed.
Best for: LSTM, embedding-heavy models, CPU serving.
"""
quantized = quantize_dynamic(
model,
qconfig_spec={nn.Linear}, # Quantize Linear layers
dtype=torch.qint8,
)
return quantized
def apply_static_quantization(
model: nn.Module,
calibration_data: torch.Tensor,
) -> nn.Module:
"""
Static quantization: both weights and activations quantized to INT8.
Requires calibration dataset to measure activation ranges.
Best for: CNNs, MLPs, GPU serving. Higher speedup than dynamic.
"""
model.eval()
model.qconfig = torch.quantization.get_default_qconfig("fbgemm") # For x86 CPU
# Prepare: insert observers to measure activation ranges
prepared = torch.quantization.prepare(model)
# Calibrate: run representative data through the model
with torch.no_grad():
prepared(calibration_data)
# Convert: replace FP32 operations with INT8 equivalents
quantized = torch.quantization.convert(prepared)
return quantized
def benchmark_quantization(
model_fp32: nn.Module,
model_int8: nn.Module,
batch_size: int = 32,
feature_dim: int = 512,
n_runs: int = 1000,
):
"""Compare latency: FP32 vs INT8 on CPU."""
import time
x = torch.randn(batch_size, feature_dim)
# FP32 benchmark
model_fp32.eval()
with torch.no_grad():
for _ in range(10): # warmup
model_fp32(x)
start = time.perf_counter()
with torch.no_grad():
for _ in range(n_runs):
model_fp32(x)
fp32_ms = (time.perf_counter() - start) / n_runs * 1000
# INT8 benchmark
model_int8.eval()
with torch.no_grad():
for _ in range(10):
model_int8(x)
start = time.perf_counter()
with torch.no_grad():
for _ in range(n_runs):
model_int8(x)
int8_ms = (time.perf_counter() - start) / n_runs * 1000
print(f"FP32: {fp32_ms:.2f}ms per batch ({batch_size} examples)")
print(f"INT8: {int8_ms:.2f}ms per batch ({batch_size} examples)")
print(f"Speedup: {fp32_ms / int8_ms:.1f}x")
TorchScript and torch.compile
# model_compilation.py
import torch
import time
def compile_model_for_latency(
model: torch.nn.Module,
example_input: torch.Tensor,
method: str = "torchscript",
) -> torch.nn.Module:
"""
Compile a model for lower inference latency.
Methods:
- "torchscript": JIT compilation, works everywhere, good for deployment
- "compile": torch.compile with Inductor backend, best for CUDA
- "onnx": Export to ONNX for cross-platform deployment
"""
model.eval()
if method == "torchscript":
# Trace the model with an example input
# Best for models with fixed control flow
with torch.no_grad():
scripted = torch.jit.trace(model, example_input)
# Optimize the scripted model
scripted = torch.jit.optimize_for_inference(scripted)
return scripted
elif method == "compile":
# torch.compile with Triton/Inductor backend (PyTorch 2.0+)
# Fuses operators, generates optimized CUDA kernels
compiled = torch.compile(
model,
backend="inductor",
mode="reduce-overhead", # Minimize CUDA kernel launch overhead
)
# Warmup to trigger compilation
with torch.no_grad():
for _ in range(5):
compiled(example_input)
return compiled
elif method == "onnx":
import onnx
import onnxruntime as ort
onnx_path = "/tmp/model_optimized.onnx"
torch.onnx.export(
model,
example_input,
onnx_path,
input_names=["input"],
output_names=["output"],
dynamic_axes={"input": {0: "batch_size"}},
opset_version=17,
do_constant_folding=True,
)
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
session = ort.InferenceSession(
onnx_path,
sess_options=sess_options,
providers=["CUDAExecutionProvider", "CPUExecutionProvider"],
)
return session
else:
raise ValueError(f"Unknown method: {method}")
Batching Strategies for Latency
The tension in batching: larger batches increase throughput (GPU utilization) but increase p99 latency (requests wait for batch formation). For latency-critical workloads, use adaptive batching:
# adaptive_batcher.py
import asyncio
import time
import numpy as np
from dataclasses import dataclass, field
from typing import Optional
import torch
@dataclass
class InferenceRequest:
request_id: str
features: np.ndarray
future: asyncio.Future
arrival_time: float = field(default_factory=time.perf_counter)
class AdaptiveBatcher:
"""
Dynamic batcher that adapts batch formation timeout based on current load.
At low load: short timeout to maintain low latency
At high load: longer timeout to form larger batches (better GPU utilization)
"""
def __init__(
self,
model: torch.nn.Module,
max_batch_size: int = 64,
min_wait_ms: float = 1.0, # Minimum: 1ms wait at low load
max_wait_ms: float = 20.0, # Maximum: 20ms wait at high load
load_threshold_rps: float = 100.0,
):
self.model = model
self.max_batch_size = max_batch_size
self.min_wait = min_wait_ms / 1000
self.max_wait = max_wait_ms / 1000
self.load_threshold = load_threshold_rps
self.pending: list[InferenceRequest] = []
self.lock = asyncio.Lock()
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.model.to(self.device).eval()
# Load tracking
self.request_times: list[float] = []
self.window_seconds = 5.0
def _estimate_rps(self) -> float:
"""Estimate requests per second from recent request history."""
now = time.perf_counter()
self.request_times = [t for t in self.request_times if now - t < self.window_seconds]
return len(self.request_times) / self.window_seconds
def _adaptive_wait(self) -> float:
"""Compute wait time based on current load."""
rps = self._estimate_rps()
load_fraction = min(1.0, rps / self.load_threshold)
return self.min_wait + (self.max_wait - self.min_wait) * load_fraction
async def predict(self, features: np.ndarray, request_id: str) -> np.ndarray:
"""Submit a request for batched inference."""
loop = asyncio.get_event_loop()
future = loop.create_future()
req = InferenceRequest(
request_id=request_id,
features=features,
future=future,
)
async with self.lock:
self.request_times.append(time.perf_counter())
self.pending.append(req)
if len(self.pending) >= self.max_batch_size:
asyncio.create_task(self._process_batch())
return await future
async def _flush_loop(self):
"""Background loop that flushes on timeout."""
while True:
wait = self._adaptive_wait()
await asyncio.sleep(wait)
async with self.lock:
if self.pending:
oldest_wait = time.perf_counter() - self.pending[0].arrival_time
if oldest_wait >= self.min_wait:
asyncio.create_task(self._process_batch())
async def _process_batch(self):
"""Process all pending requests as a single GPU batch."""
async with self.lock:
if not self.pending:
return
batch = self.pending[:self.max_batch_size]
self.pending = self.pending[self.max_batch_size:]
features_np = np.stack([r.features for r in batch])
batch_latency_start = time.perf_counter()
try:
with torch.no_grad():
x = torch.from_numpy(features_np).float().to(self.device)
output = self.model(x)
results = output.cpu().numpy()
batch_latency_ms = (time.perf_counter() - batch_latency_start) * 1000
for i, req in enumerate(batch):
queue_wait_ms = (batch_latency_start - req.arrival_time) * 1000
if not req.future.done():
req.future.set_result(results[i])
except Exception as e:
for req in batch:
if not req.future.done():
req.future.set_exception(e)
Pre-Computation and Caching
The most powerful latency optimization: do not run the model at all. Pre-compute predictions and serve from cache.
# prediction_cache.py
import hashlib
import numpy as np
import redis
import time
from typing import Optional
class PredictionCache:
"""
Two-level prediction cache for ML inference.
L1: In-process LRU (microseconds)
L2: Redis (1-2ms)
Appropriate when: many requests share the same input
(same user, same item, same query). Surprisingly common in
search, recommendation, and fraud detection.
"""
def __init__(
self,
redis_client: redis.Redis,
l1_size: int = 10_000,
l2_ttl_seconds: int = 30,
):
self.redis = redis_client
self.l2_ttl = l2_ttl_seconds
# L1: simple dict with LRU eviction (use functools.lru_cache in real code)
self.l1_cache: dict = {}
self.l1_size = l1_size
@staticmethod
def _cache_key(features: np.ndarray) -> str:
"""Stable cache key from feature vector."""
return hashlib.sha256(features.astype(np.float32).tobytes()).hexdigest()[:16]
def get(self, features: np.ndarray) -> Optional[np.ndarray]:
"""Check L1 then L2 cache."""
key = self._cache_key(features)
# L1 check - microseconds
if key in self.l1_cache:
value, expiry = self.l1_cache[key]
if time.time() < expiry:
return value
del self.l1_cache[key]
# L2 check - 1-2ms Redis
raw = self.redis.get(f"pred:{key}")
if raw is not None:
result = np.frombuffer(raw, dtype=np.float32)
# Promote to L1
self._set_l1(key, result)
return result
return None
def set(self, features: np.ndarray, prediction: np.ndarray):
"""Store in both L1 and L2."""
key = self._cache_key(features)
self._set_l1(key, prediction)
self.redis.setex(
f"pred:{key}",
self.l2_ttl,
prediction.astype(np.float32).tobytes(),
)
def _set_l1(self, key: str, value: np.ndarray, ttl_seconds: float = 5.0):
if len(self.l1_cache) >= self.l1_size:
# Evict one random entry (simplified LRU)
evict_key = next(iter(self.l1_cache))
del self.l1_cache[evict_key]
self.l1_cache[key] = (value, time.time() + ttl_seconds)
Memory Layout Optimization
GPU performance is heavily influenced by memory access patterns. Tensors with poor layout cause non-coalesced memory accesses that reduce throughput by 10-30x:
# memory_layout.py
import torch
import time
def benchmark_memory_layout(batch_size: int = 64, seq_len: int = 128, hidden: int = 768):
"""
Show the latency difference between optimal and suboptimal tensor layouts.
"""
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = torch.nn.Linear(hidden, hidden).to(device)
# Suboptimal: non-contiguous tensor from a slice or transpose
base = torch.randn(seq_len, batch_size, hidden, device=device)
non_contiguous = base.transpose(0, 1) # [batch, seq, hidden] - but non-contiguous
assert not non_contiguous.is_contiguous()
# Optimal: contiguous tensor in the right layout
contiguous = non_contiguous.contiguous() # Makes a contiguous copy
n = 200
with torch.no_grad():
# Warmup
for _ in range(10):
model(contiguous[:, 0, :])
# Non-contiguous benchmark
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(n):
x = non_contiguous.reshape(batch_size * seq_len, hidden)
model(x)
torch.cuda.synchronize()
non_contig_ms = (time.perf_counter() - t0) / n * 1000
# Contiguous benchmark
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(n):
x = contiguous.reshape(batch_size * seq_len, hidden)
model(x)
torch.cuda.synchronize()
contig_ms = (time.perf_counter() - t0) / n * 1000
print(f"Non-contiguous: {non_contig_ms:.2f}ms")
print(f"Contiguous: {contig_ms:.2f}ms")
print(f"Speedup: {non_contig_ms / contig_ms:.1f}x")
CUDA Pinned Memory and Async Transfers
CPU-to-GPU memory transfer is a significant latency source. Pinned memory (page-locked) enables asynchronous transfers that overlap with GPU computation:
# cuda_pinned_memory.py
import torch
import numpy as np
import time
class PinnedMemoryBuffer:
"""
Pre-allocated pinned memory buffer for zero-copy CPU-GPU transfers.
Eliminates memory allocation overhead on the critical path.
"""
def __init__(self, max_batch_size: int, feature_dim: int):
# Pre-allocate pinned memory once at startup
self.pinned_buffer = torch.zeros(
max_batch_size,
feature_dim,
dtype=torch.float32,
pin_memory=True, # Lock in RAM - enables async DMA transfer
)
self.device = torch.device("cuda")
def copy_to_gpu_async(
self,
features: np.ndarray,
stream: torch.cuda.Stream,
) -> torch.Tensor:
"""
Copy numpy features to GPU using async DMA transfer.
The CPU can continue doing other work while the transfer happens.
"""
batch_size = features.shape[0]
# Copy numpy array into pre-allocated pinned buffer (CPU-CPU copy)
self.pinned_buffer[:batch_size].numpy()[:] = features
# Async transfer: DMA copies while CPU continues
with torch.cuda.stream(stream):
gpu_tensor = self.pinned_buffer[:batch_size].to(
self.device,
non_blocking=True, # Async copy
)
return gpu_tensor
def benchmark_transfer_strategies(batch_size: int = 32, feature_dim: int = 512):
if not torch.cuda.is_available():
print("CUDA not available, skipping benchmark")
return
device = torch.device("cuda")
features = np.random.randn(batch_size, feature_dim).astype(np.float32)
n = 500
# Strategy 1: Standard transfer (allocates new memory each time)
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(n):
x = torch.from_numpy(features).to(device)
torch.cuda.synchronize()
standard_ms = (time.perf_counter() - t0) / n * 1000
# Strategy 2: Pinned memory transfer
buffer = PinnedMemoryBuffer(batch_size, feature_dim)
stream = torch.cuda.Stream()
torch.cuda.synchronize()
t0 = time.perf_counter()
for _ in range(n):
x = buffer.copy_to_gpu_async(features, stream)
torch.cuda.current_stream().wait_stream(stream)
torch.cuda.synchronize()
pinned_ms = (time.perf_counter() - t0) / n * 1000
print(f"Standard transfer: {standard_ms:.3f}ms")
print(f"Pinned memory: {pinned_ms:.3f}ms")
print(f"Speedup: {standard_ms / pinned_ms:.1f}x")
Real Production Targets
| System | Latency Requirement | Technique |
|---|---|---|
| Google Search ranking | p99 under 50ms | Distilled models, TensorFlow XLA compilation |
| Meta News Feed | p99 under 30ms | INT8 quantization, FBGEMM, custom CUDA kernels |
| Meta Ads (Advantage+ ranking) | p99 under 10ms | INT8, model distillation, pre-computed embeddings |
| Stripe fraud detection | p99 under 100ms | CPU inference, no batching, XGBoost |
| TikTok recommendation | p99 under 30ms | GPU with dynamic batching, model parallelism |
| Apple Face ID | under 1ms | Apple Neural Engine, CoreML, on-device INT8 |
Production Engineering Notes
Profile before optimizing: Use PyTorch's built-in profiler to identify where time is actually spent before applying any optimization. Engineers consistently misjudge where the bottleneck is.
import torch.profiler
with torch.profiler.profile(
activities=[
torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA,
],
record_shapes=True,
with_stack=True,
) as prof:
with torch.no_grad():
model(features)
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
CUDA kernel launch overhead compounds: Each PyTorch operation that runs on GPU has a ~10-50 microsecond kernel launch overhead. A model with 100 small operations can spend 2-5ms in kernel launches alone. Fusion (combining multiple operations into one kernel) eliminates this. torch.compile does this automatically for most models.
Latency vs throughput is a trade-off you must make explicitly: The settings that minimize p99 latency (small max_batch_size, short wait time) are opposite to the settings that maximize throughput (large batch, long wait). For latency-critical systems, accept lower GPU utilization to maintain latency targets. Define your SLA first, then optimize for throughput within that constraint.
:::warning The p99 Trap - Averages Hide Tail Latency Most optimization efforts focus on average latency. But SLAs are typically defined in terms of p99 or p99.9. A model that runs in 5ms average but 200ms p99 (due to occasional GPU kernel recompilation, GC pauses, or network jitter) fails its SLA. Measure and optimize the tail. Use histogram metrics (Prometheus histogram) not averages. :::
:::danger GPU Warm-Up Latency The first inference request after a GPU model is loaded triggers JIT compilation, kernel warm-up, and CUDA graph initialization. This first request takes 10-100x longer than subsequent ones. Without explicit warm-up (running 10-100 dummy requests at startup), the first real user request hits this cold-start latency. Always warm up models during server initialization, before marking the instance as ready in health checks. :::
Interview Q&A
Q: How would you get a model serving endpoint from 40ms p99 to under 5ms p99?
Profile first to identify the bottleneck. Likely steps in order of impact: (1) Switch serialization from REST+JSON to gRPC+Protobuf - saves 1-5ms. (2) Apply INT8 quantization - saves 2-4x on computation time. (3) Compile with torch.compile or TorchScript + optimize_for_inference - saves another 1-2x. (4) Use CUDA pinned memory for transfers - saves 0.5-1ms. (5) Implement dynamic batching to amortize kernel launch overhead. (6) Use prediction caching if the input space has sufficient repetition. At each step, measure the actual improvement and verify accuracy is maintained.
Q: Why does INT8 quantization reduce latency and what is the accuracy trade-off?
INT8 quantization replaces 32-bit floating point weights and activations with 8-bit integers. Modern CPUs and GPUs have SIMD and tensor core instructions that process 4x more INT8 values per clock cycle than FP32, leading to 2-4x throughput improvement. The accuracy trade-off is typically 0.5-1% on classification tasks (measured by top-1 accuracy or AUC) with post-training quantization, or less than 0.1% with quantization-aware training. The accuracy loss is higher for models that rely on fine-grained weight differences (attention mechanisms with small softmax values) and lower for embedding-heavy models. Always measure accuracy on your specific task with your quantized model before deployment.
Q: What is the difference between latency and throughput in ML serving, and how do you optimize for each?
Latency is the time for one request to complete. Throughput is the number of requests per unit time. They are often in tension. Maximizing throughput: use large batches (higher GPU utilization), longer wait times for batch formation, and as many concurrent model instances as GPU memory allows. Minimizing latency: use small or no batches (no wait time), prioritize in-flight requests over new ones (preemptive scheduling), and ensure GPU memory is not over-provisioned (which causes cache evictions). For latency-critical systems, set a maximum batch formation wait time (e.g., 5ms) and never exceed it, even if that means smaller batches and lower GPU utilization.
Q: How does pre-computation reduce inference latency and when is it applicable?
Pre-computation runs the model before the request arrives and stores the result. At request time, serving is a key-value lookup (1-2ms) instead of model inference (10-50ms). It is applicable when: inputs are from a known finite set (user IDs, item IDs, query templates), freshness requirements allow results to be computed in advance (predictions that are valid for 30+ minutes), and the total pre-computation space is manageable (cannot pre-compute for every possible input combination). Common applications: user embedding pre-computation in recommendation, item score pre-computation for top-K retrieval, query-expansion dictionaries in search. Not applicable for: fraud detection (depends on the specific transaction), real-time bidding (depends on the specific page + user context at the moment of the bid).
Q: What is CUDA kernel launch overhead and how does operator fusion reduce it?
Every PyTorch operation that executes on the GPU submits a CUDA kernel - a program that runs on GPU cores. Submitting a kernel from the CPU takes 10-50 microseconds. A model with 100 sequential operations (attention heads, layer norms, activations) submits 100 kernels, costing 1-5ms in launch overhead alone. Operator fusion combines multiple operations into a single kernel. Instead of three kernels (layer norm, matmul, activation), one fused kernel handles all three. torch.compile's Inductor backend does this automatically: it analyzes the computation graph and generates fused CUDA kernels that match what a hand-written CUDA expert would write. For transformer models, kernel fusion alone provides 2-3x latency improvement.
