:::tip 🎮 Interactive Playground Visualize this concept: Try the Latency vs Throughput demo on the EngineersOfAI Playground - no code required. :::
Latency vs Throughput Trade-offs in ML Systems
Latency and throughput are not independent dials you can tune separately. They are coupled through a fundamental law of queueing theory, and ignoring that coupling is how teams build systems that fail exactly when they need them most.
The Production Moment
The inference team had spent six weeks optimizing their model serving stack. They had profiled every layer, implemented dynamic batching, and reduced the p50 latency from 80ms to 12ms. The engineering lead presented the results at the all-hands. Impressive graphs. Applause.
Three weeks later, the product launched. Traffic was 40× higher than the pre-launch estimate. The system, which had been serving 500 requests per second in testing with p99 latency of 45ms, was now being asked to serve 20,000 requests per second. Within 20 minutes of launch, p99 latency had climbed to 8,000ms. Users were getting timeouts. The on-call engineer was getting paged every 90 seconds.
What went wrong? The team had optimized for the median case (p50 latency) without understanding how queueing theory governs the relationship between load and latency. As utilization approaches 100%, latency doesn't just increase - it explodes exponentially. They had optimized a system that worked perfectly at 10% utilization and collapsed at 90%.
This lesson is about the physics of serving infrastructure: why latency and throughput are coupled, how to reason about their relationship, and how production engineers use batching, caching, and capacity planning to manage the trade-off intelligently.
Why This Trade-off Exists
The fundamental tension is simple to state: getting work done faster (low latency) and getting more work done per unit time (high throughput) require different things from your infrastructure.
Low latency demands: processing requests immediately, not waiting to batch them; keeping resources underutilized to ensure availability; avoiding queues that add waiting time.
High throughput demands: batching requests together to amortize per-request overhead; saturating expensive resources (GPUs) to get maximum utilization; accepting some queuing delay to pack requests efficiently.
These demands are in direct conflict. The optimal point between them depends on your use case, and understanding where that point is - and why it moves as load increases - is the core skill of ML serving engineering.
Little's Law: The Foundation
Before batching strategies, KV caches, or dynamic scheduling - there is one equation you must understand. It is Little's Law (John Little, 1961, MIT), and it governs every queueing system from bank tellers to GPU servers.
Where:
- = average number of requests in the system (in queue + being processed)
- = average arrival rate (requests per second)
- = average time a request spends in the system (latency)
This is a mathematical identity - it holds for any stable queueing system regardless of arrival distribution, service time distribution, or queue discipline. It is one of the most powerful and underused tools in systems engineering.
What it tells you:
If you know two of the three quantities, you can compute the third. More importantly, it reveals hidden relationships:
def little_law_analysis(arrival_rate_qps: float, avg_latency_ms: float) -> dict:
"""
Apply Little's Law to understand system state.
L = lambda * W
"""
# Average latency in seconds
avg_latency_s = avg_latency_ms / 1000
# Average number of requests in the system
avg_requests_in_system = arrival_rate_qps * avg_latency_s
# If we know max concurrent requests the system can handle (e.g., GPU batch size limit)
# we can find the max stable throughput
max_concurrent = 32 # e.g., max batch size
max_stable_qps = max_concurrent / avg_latency_s
return {
"avg_requests_in_system": avg_requests_in_system,
"max_stable_qps": max_stable_qps,
"utilization_pct": (arrival_rate_qps / max_stable_qps) * 100
}
# Example: ML serving system
# Avg latency: 50ms, arrival rate: 400 QPS
result = little_law_analysis(400, 50)
print(f"Average requests in system: {result['avg_requests_in_system']:.0f}") # 20
print(f"Max stable QPS: {result['max_stable_qps']:.0f}") # 640 QPS
print(f"Current utilization: {result['utilization_pct']:.1f}%") # 62.5%
The Utilization Death Curve
The most important application of Little's Law is understanding what happens as utilization (, where is service rate) approaches 1 (100%).
For an M/M/1 queue (Poisson arrivals, exponential service times - a reasonable approximation for ML serving):
As , latency .
import numpy as np
import matplotlib.pyplot as plt
def mm1_mean_latency(service_rate: float, utilization: float) -> float:
"""
M/M/1 queue mean sojourn time (time in system including waiting).
service_rate: mu (requests processed per second)
utilization: rho = lambda/mu (fraction of capacity used)
"""
if utilization >= 1.0:
return float('inf')
mean_service_time = 1.0 / service_rate
return mean_service_time / (1 - utilization)
service_rate = 100 # 100 requests/second capacity
utilizations = [0.1, 0.3, 0.5, 0.7, 0.8, 0.9, 0.95, 0.99]
print("Utilization | Mean Latency | Latency Ratio vs 10%")
print("-" * 55)
baseline = mm1_mean_latency(service_rate, 0.1)
for u in utilizations:
w = mm1_mean_latency(service_rate, u)
ratio = w / baseline
print(f" {u*100:.0f}% | {w*1000:.1f}ms | {ratio:.1f}x")
# Output:
# Utilization | Mean Latency | Latency Ratio vs 10%
# -------------------------------------------------------
# 10% | 11.1ms | 1.0x
# 30% | 14.3ms | 1.3x
# 50% | 20.0ms | 1.8x
# 70% | 33.3ms | 3.0x
# 80% | 50.0ms | 4.5x
# 90% | 100.0ms | 9.0x
# 95% | 200.0ms | 18.0x
# 99% | 1000.0ms | 90.0x
The table tells the story: at 90% utilization, latency is 9× higher than at 10% utilization. At 99%, it's 90× higher. This is the production failure mode - the team in the opening scenario had designed for 500 QPS when capacity was ~550 QPS (91% utilization under test), and at 20,000 QPS they hit 100% utilization with catastrophic latency.
Production rule: Never run ML serving infrastructure above 70% average utilization. Reserve 30% for traffic spikes and to keep latency within the acceptable linear region.
Tail Latency: Why p50 Lies
The p50 (median) latency is the metric that lies. A system can have excellent p50 latency and catastrophic p99 latency simultaneously, and the p99 is what determines user experience for 1% of your traffic - which at 10M requests/day is 100,000 users seeing terrible performance.
Why does tail latency happen?
- Hardware variance: CPU scheduling jitter, memory GC pauses, network packet retransmissions
- Resource contention: multiple threads competing for L3 cache, memory bus, PCIe bandwidth
- Long-tail request complexity: some requests are harder than others (longer documents, more complex queries)
- Head-of-line blocking: one slow request blocks others in the same batch
import numpy as np
from scipy import stats
def analyze_latency_distribution(latencies_ms: list[float]) -> dict:
"""Analyze latency distribution for serving system diagnosis."""
arr = np.array(latencies_ms)
return {
"p50_ms": float(np.percentile(arr, 50)),
"p90_ms": float(np.percentile(arr, 90)),
"p95_ms": float(np.percentile(arr, 95)),
"p99_ms": float(np.percentile(arr, 99)),
"p999_ms": float(np.percentile(arr, 99.9)),
"mean_ms": float(np.mean(arr)),
"tail_ratio": float(np.percentile(arr, 99) / np.percentile(arr, 50)),
"recommendation": "investigate tail" if
np.percentile(arr, 99) > 3 * np.percentile(arr, 50) else "healthy"
}
# Simulate a bimodal latency distribution (normal requests + occasional GC pauses)
np.random.seed(42)
normal_latencies = np.random.normal(loc=15, scale=3, size=9900) # 99% of requests
gc_pause_latencies = np.random.normal(loc=200, scale=50, size=100) # 1% hit GC pause
all_latencies = np.concatenate([normal_latencies, gc_pause_latencies])
result = analyze_latency_distribution(all_latencies.tolist())
for k, v in result.items():
if isinstance(v, float):
print(f" {k}: {v:.1f}")
else:
print(f" {k}: {v}")
# p50_ms: 14.9 (looks great!)
# p99_ms: 212.5 (terrible for 1% of users)
# tail_ratio: 14.3 (p99/p50 > 3 signals a problem)
The Fan-out Problem: Many ML serving systems involve parallel requests - fetch features from multiple services, query multiple model replicas. The end-to-end latency is governed by the slowest component. This is why tail latency in subsystems amplifies at the system level.
If a system makes 10 parallel requests each with p99 = 10ms, the probability that at least one exceeds 10ms is approximately . The system's "effective p99" is actually the single service's p90.
def effective_pXX_with_fanout(
per_service_latency_pct: dict, # {50: 5, 90: 15, 99: 50} in ms
num_parallel_services: int,
target_percentile: float = 0.99
) -> float:
"""
When making N parallel calls, system tail latency is max of N draws.
The target_percentile of the MAX is the single-service percentile
where CDF^N = target_percentile.
"""
# What single-service percentile maps to the target system percentile?
single_service_pct = target_percentile ** (1.0 / num_parallel_services)
return single_service_pct
# To achieve p99 system latency, what single-service percentile must we target?
for n in [1, 2, 5, 10, 20]:
pct = effective_pXX_with_fanout({}, n) * 100
print(f" {n} parallel calls: each service must meet p{pct:.2f} to achieve p99 system")
# 1 parallel calls: each service must meet p99.00
# 2 parallel calls: each service must meet p99.50
# 5 parallel calls: each service must meet p99.80
# 10 parallel calls: each service must meet p99.90
# 20 parallel calls: each service must meet p99.95
This is why microservice architectures with many parallel calls have notoriously bad tail latency - each added service dependency adds to the effective "system p99" burden on individual services.
Batching: The Core Trade-off
Batching is the primary lever for improving GPU throughput at the cost of increased latency. Understanding it deeply is essential for ML serving.
Why Batching Helps
GPU execution is highly parallelized. A GPU with thousands of CUDA cores can process a batch of 64 images almost as fast as it processes a single image, because the matrix multiplications in neural network forward passes are data-parallel.
For a transformer model, the forward pass time scales roughly as:
where is a small constant (roughly 0.1–0.3 depending on model size and hardware). This means batching 32 requests takes only ~1.5× as long as batching 1 request, but serves 32× more users per GPU call.
def compute_batching_efficiency(
single_request_latency_ms: float,
batch_sizes: list[int],
# Empirical parameter: how much does latency grow with batch size?
# Transformer: alpha ~ 0.2 (sublinear growth)
alpha: float = 0.2
) -> dict:
results = {}
for batch_size in batch_sizes:
batch_latency = single_request_latency_ms * (1 + alpha * np.log(batch_size))
requests_per_second = batch_size / (batch_latency / 1000)
latency_per_request = batch_latency
throughput_ratio = requests_per_second / (1 / (single_request_latency_ms / 1000))
results[batch_size] = {
"batch_latency_ms": round(batch_latency, 1),
"requests_per_second": round(requests_per_second, 0),
"throughput_improvement": round(throughput_ratio, 1)
}
return results
single_latency = 20.0 # 20ms for single request
batching_result = compute_batching_efficiency(single_latency, [1, 4, 8, 16, 32, 64, 128])
print(f"{'Batch':<8} {'Latency':<12} {'RPS':<12} {'Throughput x'}")
print("-" * 45)
for batch_size, metrics in batching_result.items():
print(f"{batch_size:<8} {metrics['batch_latency_ms']:<12} "
f"{metrics['requests_per_second']:<12.0f} {metrics['throughput_improvement']:.1f}x")
# Batch Latency RPS Throughput x
# ---------------------------------------------
# 1 20.0 50 1.0x
# 4 25.8 155 3.1x
# 8 29.2 274 5.5x
# 16 33.1 484 9.7x
# 32 37.1 863 17.3x
# 64 41.2 1553 31.1x
# 128 45.2 2831 56.6x
The key insight: going from batch size 1 to 32, latency increases by 1.85× (20ms to 37ms) but throughput increases by 17.3×. That is an extraordinary efficiency gain.
Static vs Dynamic Batching
Static batching: wait for a fixed batch size before processing. Simple, predictable, but adds maximum queuing latency.
Dynamic batching: process whatever requests have arrived within a time window, up to a maximum batch size. Balances latency with throughput.
import asyncio
from collections import deque
from datetime import datetime, timedelta
from dataclasses import dataclass, field
from typing import Any
@dataclass
class Request:
id: str
data: Any
future: asyncio.Future = field(default_factory=asyncio.Future)
arrival_time: datetime = field(default_factory=datetime.now)
class DynamicBatcher:
"""
Dynamic batching with configurable max batch size and max wait time.
This is the pattern used by TensorRT Serving, Triton, and vLLM.
"""
def __init__(
self,
model_fn,
max_batch_size: int = 32,
max_wait_ms: float = 10.0 # max time to wait for more requests
):
self.model_fn = model_fn
self.max_batch_size = max_batch_size
self.max_wait_ms = max_wait_ms
self.queue: deque[Request] = deque()
self._processing_task = None
async def predict(self, data: Any) -> Any:
"""Add request to batch queue and wait for result."""
request = Request(id=str(id(data)), data=data)
self.queue.append(request)
# Start background processing if not running
if self._processing_task is None or self._processing_task.done():
self._processing_task = asyncio.create_task(self._process_loop())
# Wait for this request's result
return await request.future
async def _process_loop(self):
"""Continuously process batches from the queue."""
while self.queue:
# Collect requests for this batch
deadline = datetime.now() + timedelta(milliseconds=self.max_wait_ms)
batch = []
while len(batch) < self.max_batch_size:
if self.queue:
batch.append(self.queue.popleft())
elif datetime.now() >= deadline:
break
else:
# Wait briefly for more requests to arrive
await asyncio.sleep(0.001) # 1ms polling
if not batch:
break
# Process the batch
inputs = [req.data for req in batch]
try:
results = self.model_fn(inputs) # actual model call
for req, result in zip(batch, results):
req.future.set_result(result)
except Exception as e:
for req in batch:
req.future.set_exception(e)
Continuous Batching (vLLM Pattern)
For LLM inference, static and dynamic batching have a critical limitation: requests in a batch must all finish together before new requests can join. If one request generates a very long response, short requests in the same batch are blocked.
Continuous batching (Orca, 2022; vLLM, 2023) solves this by allowing requests to join and leave the batch at each token generation step:
The result: GPU utilization increases from 50–60% (static batching) to 80–90% (continuous batching). For OpenAI-scale LLM serving, this is the difference between 2× and 5× throughput improvement over naive serving.
Caching Strategies
Caching is the other primary lever for improving latency/throughput. The key is understanding what to cache and at what level.
Feature Caching
The most impactful cache in ML serving is usually the feature cache - storing pre-computed feature vectors for users and items.
import hashlib
import json
import time
from functools import wraps
class FeatureCache:
"""
Multi-level cache for ML features.
L1: in-process memory (fastest, smallest)
L2: Redis cluster (fast, medium size)
L3: Compute from scratch (slow, unlimited)
"""
def __init__(self, redis_client, l1_max_size: int = 10_000, ttl_seconds: int = 3600):
self.redis = redis_client
self.l1_cache: dict = {} # in-process LRU (simplified)
self.l1_max_size = l1_max_size
self.ttl = ttl_seconds
def get_user_features(self, user_id: str) -> dict | None:
"""3-tier feature retrieval with hit rate tracking."""
cache_key = f"user_features:{user_id}"
# L1: in-process
if cache_key in self.l1_cache:
return self.l1_cache[cache_key]
# L2: Redis
redis_val = self.redis.get(cache_key)
if redis_val:
features = json.loads(redis_val)
self._l1_set(cache_key, features)
return features
# L3: Cache miss - compute features
return None # caller must compute and backfill
def set_user_features(self, user_id: str, features: dict):
cache_key = f"user_features:{user_id}"
serialized = json.dumps(features)
self.redis.setex(cache_key, self.ttl, serialized)
self._l1_set(cache_key, features)
def _l1_set(self, key: str, value):
if len(self.l1_cache) >= self.l1_max_size:
# Simple eviction: remove oldest entry
oldest_key = next(iter(self.l1_cache))
del self.l1_cache[oldest_key]
self.l1_cache[key] = value
KV Cache for Transformer Inference
The KV (key-value) cache is an ML-specific optimization that eliminates redundant computation in autoregressive transformer inference.
In a transformer, each attention layer computes keys and values for all previous tokens. During autoregressive generation, these are the same for all previous positions - no need to recompute them. The KV cache stores them.
import torch
import torch.nn as nn
class AttentionWithKVCache(nn.Module):
"""
Simplified illustration of KV cache in transformer attention.
In production (vLLM, TRT-LLM), this is managed by the serving framework.
"""
def __init__(self, d_model: int, num_heads: int):
super().__init__()
self.d_model = d_model
self.num_heads = num_heads
self.head_dim = d_model // num_heads
self.q_proj = nn.Linear(d_model, d_model)
self.k_proj = nn.Linear(d_model, d_model)
self.v_proj = nn.Linear(d_model, d_model)
self.out_proj = nn.Linear(d_model, d_model)
def forward(
self,
x: torch.Tensor, # [batch, seq_len, d_model]
kv_cache: tuple | None = None # (keys, values) from previous steps
) -> tuple[torch.Tensor, tuple]:
q = self.q_proj(x)
k = self.k_proj(x)
v = self.v_proj(x)
if kv_cache is not None:
# PREFILL phase (first forward pass): cache is empty
# DECODE phase (token generation): extend the cache
k_prev, v_prev = kv_cache
k = torch.cat([k_prev, k], dim=1) # append new key
v = torch.cat([v_prev, v], dim=1) # append new value
# Standard attention computation
scale = self.head_dim ** -0.5
scores = torch.bmm(q, k.transpose(1, 2)) * scale
attn = torch.softmax(scores, dim=-1)
out = torch.bmm(attn, v)
out = self.out_proj(out)
# Return new cache for next token
return out, (k, v)
KV cache memory cost:
For a 7B LLaMA model (32 layers, d_model=4096, fp16):
At seq_len=4,096: ~2 GB per request. Serving 40 concurrent requests = 80 GB - fills an entire A100 80GB. This is why GPU memory management for LLMs is a critical engineering challenge, and why vLLM's PagedAttention (Kwon et al., 2023) - which manages KV cache memory like virtual memory pages - was such a significant contribution.
Semantic Caching
For LLM serving, you can cache entire responses for semantically similar queries:
import numpy as np
class SemanticCache:
"""
Cache LLM responses based on semantic similarity of queries.
Uses embedding-based similarity rather than exact match.
"""
def __init__(self, embedding_model, similarity_threshold: float = 0.95):
self.embedding_model = embedding_model
self.threshold = similarity_threshold
self.cache: list[dict] = [] # {embedding, query, response}
def get(self, query: str) -> str | None:
"""Return cached response if a similar query exists."""
query_embedding = self.embedding_model.encode(query)
for entry in self.cache:
similarity = self._cosine_similarity(query_embedding, entry['embedding'])
if similarity >= self.threshold:
return entry['response'] # Cache hit!
return None # Cache miss
def set(self, query: str, response: str):
embedding = self.embedding_model.encode(query)
self.cache.append({
'embedding': embedding,
'query': query,
'response': response
})
def _cosine_similarity(self, a: np.ndarray, b: np.ndarray) -> float:
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
GPT Cache (2023) demonstrated that semantic caching can reduce LLM serving costs by 30–60% for production workloads with repeated query patterns.
Async vs Sync Serving
The architecture of how requests are received and processed has a significant impact on the latency/throughput trade-off.
Synchronous serving is simpler but wasteful: while the GPU processes one request, all other arriving requests queue up and the CPU is idle. Asynchronous serving keeps the GPU busy while accepting new requests, maximizing utilization.
Production ML serving frameworks (Triton Inference Server, TorchServe, Ray Serve, vLLM) all implement asynchronous serving with dynamic batching.
Production Engineering: The Serving Decision Matrix
def recommend_serving_architecture(
avg_latency_requirement_ms: float,
peak_qps: float,
model_size_gb: float,
output_variable: bool = False, # True for generative models
) -> str:
"""
Simplified serving architecture recommender.
Real decisions require deeper analysis, but this gives a starting point.
"""
recommendations = []
# Latency tier
if avg_latency_requirement_ms < 10:
recommendations.append("Ultra-low latency: model quantization required (INT8/INT4)")
recommendations.append("No batch waiting - serve immediately on arrival")
recommendations.append("Keep model in L3 cache if possible (small models only)")
elif avg_latency_requirement_ms < 100:
recommendations.append("Low latency: dynamic batching with 5-10ms max wait")
recommendations.append("FP16 inference, target GPU utilization 60-70%")
else:
recommendations.append("Throughput-optimized: static batching acceptable")
recommendations.append("INT8 quantization for 2x throughput improvement")
# Scale tier
if peak_qps > 10_000:
recommendations.append("Scale: need horizontal scaling with load balancing")
recommendations.append("Use model parallelism if model_size_gb > GPU memory")
elif peak_qps > 1_000:
recommendations.append("Scale: multiple GPU replicas, connection pooling")
# Variable output (generative)
if output_variable:
recommendations.append("LLM serving: use continuous batching (vLLM/TRT-LLM)")
recommendations.append("PagedAttention for KV cache memory management")
return "\n".join(f" - {r}" for r in recommendations)
print("=== RAG chatbot (200ms, 500 QPS, 7B model, generative) ===")
print(recommend_serving_architecture(200, 500, 14, True))
Common Mistakes
:::danger Designing at Average, Operating at Peak Every serving system will eventually receive 5–10× its average load. If you design your GPU utilization target at 80% average, a 2× traffic spike sends you to 160% - impossible, meaning you shed traffic or timeout. Design GPU utilization to never exceed 50–60% at average load, leaving headroom for peaks. :::
:::danger Ignoring Head-of-Line Blocking When batching requests with highly variable processing time (variable-length documents, variable response lengths), one long request can delay all others in the batch. Solutions: request hedging (send the same request to two replicas, use the faster one), time limits per request, or sequence length bucketing (batch requests of similar length together). :::
:::warning Optimizing p50 While Ignoring p99 p50 improvements that don't improve p99 are often meaningless for user experience. A user who experiences the 99th percentile doesn't care that the median is fast. Always monitor and optimize p99 (and p999 for high-volume systems), not just the mean or median. :::
:::warning Treating Latency and Throughput as Independent The most common architectural mistake: someone says "optimize for low latency" and another says "we need to increase throughput," and they implement changes in opposite directions simultaneously. These are coupled trade-offs that require a coordinated strategy. Set the operating point (latency SLO) first, then maximize throughput within that constraint. :::
Interview Q&A
Q1: Explain Little's Law and how it applies to ML serving infrastructure.
Little's Law states : the average number of requests in a system equals the arrival rate times the average time spent in the system. For ML serving, if you know your system handles 500 requests with a mean latency of 40ms, there are on average requests in-flight at any time. If your GPU batch size limit is 32, you have headroom. If it's 16, you're over capacity and latency will explode.
The practical implication: as utilization approaches 100%, queueing theory (M/M/1 queue model) predicts latency grows as where is utilization. At 90% utilization, latency is 10× the service time. This is why production systems must maintain substantial utilization headroom - not because GPU time is cheap, but because the tail latency behavior at high utilization is catastrophically non-linear.
Q2: What is the difference between static batching, dynamic batching, and continuous batching? When would you use each?
Static batching: wait for exactly N requests before processing. Simple to implement. Problem: if arrivals are irregular, you either wait too long (high latency) or process with many empty slots (wasted GPU compute).
Dynamic batching: process whatever requests arrive within a time window T, up to maximum batch size N. Better latency/throughput balance. Triton Inference Server and most production frameworks implement this. Good for fixed-output models (classification, embedding, ranking).
Continuous batching (also called iteration-level scheduling): specific to autoregressive LLM inference. Rather than waiting for an entire batch to finish, new requests are added to the batch at each token generation step, replacing completed requests. Developed in the Orca paper (Yu et al., 2022) and popularized by vLLM. Achieves 2–23× higher throughput than static batching for variable-length LLM outputs. Use whenever serving autoregressive language models.
Q3: What is the KV cache in transformer serving, and why does it matter for production?
In autoregressive transformer inference, generating each new token requires attention over all previous tokens. Without caching, this would require recomputing the attention keys and values for all prior tokens at every step - compute per sequence.
The KV cache stores the keys and values computed for all previous tokens in GPU memory, so each new token only needs to compute its own K, V, and then attend over the cached K, V pairs. This reduces per-token computation from to .
The production challenge: KV cache memory grows with sequence length and concurrent requests. For a 7B model (32 layers, d_model=4096, fp16), each token in the KV cache occupies bytes. A 4,096-token context takes ~2 GB per request. Serving 40 concurrent users requires 80 GB just for KV cache. This is why vLLM's PagedAttention - managing KV cache in non-contiguous pages rather than one large contiguous allocation - was transformative for LLM serving throughput (2–4× improvement in token throughput).
Q4: How would you debug a production ML serving system where p99 latency has suddenly increased from 50ms to 500ms?
Systematic diagnosis from top to bottom:
First check infrastructure: Is this one server or all servers? If one server, it may be a noisy neighbor (CPU/memory contention), overheating GPU, or network issues. Check GPU utilization, CPU utilization, memory pressure, and network errors on the affected host.
Second check traffic patterns: Has QPS increased? Plot QPS vs. p99 latency over the same time window. If they moved together, it is a capacity problem - the system hit the non-linear utilization cliff. Check current utilization vs. capacity using Little's Law.
Third check data characteristics: Has request size (input token length, image resolution, payload size) changed? Longer inputs can 2–5× model inference time. Plot input size distribution before and after the degradation.
Fourth check model and feature pipeline: Was a new model deployed recently? Check model version rollout timing against latency degradation. Is the feature store responding slowly? Feature fetch latency directly adds to serving latency.
Fifth check GC and JVM: For Java-based serving (TensorFlow Serving is C++, but some infrastructure is Java), GC pauses cause periodic latency spikes. JVM heap size and GC policy matter.
Q5: What is semantic caching for LLMs and when is it worth implementing?
Semantic caching stores LLM responses keyed by embedding vectors of the queries rather than exact text matches. When a new query arrives, you embed it and check for cached responses with cosine similarity above a threshold (typically 0.92–0.98).
Worth implementing when: (1) your workload has significant query repetition - customer support chatbots, FAQ bots, search augmentation - where 20–40% of queries are semantically similar to prior queries; (2) LLM inference costs are high (GPT-4 level models at scale); (3) response latency is a concern and you have a Redis or vector DB already deployed for features.
Not worth implementing when: (1) most queries are unique (creative writing, personalized recommendations); (2) response freshness matters (the system must incorporate new information); (3) the additional embedding call latency exceeds the inference cost savings; (4) the caching threshold is wrong - too low caches incorrect responses (semantic drift), too high achieves near-zero hit rates.
GPT Cache (2023) demonstrated 30–60% cost reduction on production LLM workloads with appropriate thresholds. The embedding model used for cache lookup should be fast (100ms would negate the savings) - models like text-embedding-3-small or E5-small are good choices.
Summary
Latency and throughput are coupled through queueing theory. Little's Law governs the relationship between arrival rate, in-flight request count, and latency. As utilization approaches 100%, latency diverges - which is why the team in the opening scenario failed.
The tools for managing this trade-off: dynamic and continuous batching (trade latency for throughput), caching at multiple levels (feature cache, KV cache, semantic cache), and capacity planning with explicit utilization targets.
The operating principle: set your latency SLO first, then maximize throughput within that constraint. Never run production ML serving above 70% average utilization. Monitor p99 and p999, not just p50. And always design for peak, not average.
:::tip Key Takeaway The latency/throughput trade-off is not a dial to tune - it is a constraint curve defined by queueing theory. You cannot have both minimal latency and maximum throughput simultaneously. The goal of production ML serving engineering is to find the operating point on this curve that meets your latency SLO while maximizing throughput, and to have enough capacity headroom that traffic spikes don't send you into the non-linear latency explosion zone. :::
