What is low latency inference?

Engineering ML predictions under 10ms p99 - hardware choices, model optimization, batching strategies, pre-computation, memory layout, and real production targets.

How does ML inference optimization work in practice?

Low-Latency Inference Patterns covers low latency inference, ML inference optimization, TensorRT latency from first principles with code examples. Free lesson at https://engineersofai.com/docs/ai-systems/real-time-ml/low-latency-inference-patterns

What is the difference between low latency inference and TensorRT latency?

See the full breakdown at https://engineersofai.com/docs/ai-systems/real-time-ml/low-latency-inference-patterns

:::tip 🎮 Interactive Playground Visualize this concept: Try the Latency vs Throughput demo on the EngineersOfAI Playground - no code required. :::

Low-Latency Inference Patterns

The Production Scenario

Google Search has a hard constraint that most engineers outside the company do not appreciate: the entire search pipeline - query understanding, document retrieval, ranking, snippet generation, ad auction - must complete in under 200ms total. The ML ranking model that scores hundreds of documents against the user's query is allocated roughly 50ms of that budget. At 99,000 queries per second (Google's approximate rate), that means 99 billion model evaluations per day, each completing in under 50ms.

Meta's News Feed ranking has a tighter constraint. The feed must appear responsive as a user scrolls - that means new content must be ranked before the user reaches it, typically within 30ms of the scroll event triggering a load. The ranking model operates on a batch of 500 candidate posts with tens of features each, and it must score and rank all 500 within that 30ms window.

These are not exceptional cases. Ad systems, fraud detection, recommendation feeds, and autocomplete all operate at latency requirements that feel physically impossible when you first encounter them - but are achieved routinely in production using the specific engineering patterns this lesson covers.

The patterns are a hierarchy: start with what your hardware can do (the ceiling), optimize the model for that hardware (remove waste), optimize the data path (remove serialization and copying), and finally, pre-compute what you can so that inference at serving time is just a lookup.

Why This Exists - The Latency Ceiling of Naive Deployment

A PyTorch model in eager execution mode with FP32 weights, served through FastAPI with JSON serialization, is as far from the latency floor as you can get. Every layer of that stack adds overhead:

JSON serialization: 0.5-2ms for a moderately-sized feature vector
Python interpreter overhead: 0.1-0.5ms per request (GIL, object allocation)
FP32 computation: 2-8x slower than INT8 on modern hardware
Sequential execution: One operation at a time, not fused
CUDA kernel launch overhead: 0.05-0.2ms per GPU kernel (accumulates with many small ops)

A carefully optimized stack removes each of these overheads. The result can be 10-20x faster than the naive implementation - which is the difference between a 40ms latency and a 2ms latency for the same model and the same hardware.

Historical Context

Low-latency ML inference has been a focus at internet companies since at least 2012, when deep learning began outperforming traditional models in production. NVIDIA's TensorRT (2016) was the first widely available tool to automate inference optimization for GPU - it applies operator fusion, precision calibration, and kernel auto-tuning to minimize inference latency. The concept of quantization for inference existed in the signal processing literature for decades but was systematically applied to neural networks by Jacob et al. (Google Brain) in their quantization-aware training paper (2018).

The modern inference optimization stack - quantization to INT8 or INT4, kernel compilation with torch.compile or TensorRT, operator fusion, continuous batching - represents the cumulative learning of a decade of production inference engineering at Google, Meta, NVIDIA, and the major cloud providers.

The Latency Stack

Understanding latency requires knowing where time is actually spent:

Model computation is the largest single component and the most controllable through optimization. Every other component is overhead around the model.

Hardware Choices for Low Latency

The hardware you choose sets the ceiling on achievable latency:

Hardware	Best Latency (small model)	Best for
CPU (modern server)	0.5-5ms	Very small models, high request rate without batching
GPU A10G	1-10ms	Medium models with batching
GPU H100	0.2-2ms	Large models, highest throughput
FPGA (Xilinx, Intel)	0.01-0.5ms	Ultra-low latency, fixed model, specialized
Apple Neural Engine	0.5-3ms	On-device iOS inference

CPU vs GPU trade-off for latency: For a single request (batch size 1), modern CPUs can often match or beat GPUs due to GPU kernel launch overhead (0.05-0.2ms per kernel, which dominates when the model is small). GPUs win when you can batch many requests together. For p99 latency requirements under 5ms with no batching, a CPU implementation with ONNX Runtime may be faster than a naive GPU implementation.

Model Optimization for Latency

Quantization

Quantization reduces model weight precision from FP32 (32-bit) to INT8 (8-bit) or INT4 (4-bit). This provides two benefits: 2-4x reduction in model size (fits more in cache/memory) and 2-4x speedup in computation (INT8 SIMD operations are faster than FP32).

# quantization_for_latency.py
import torch
import torch.nn as nn
from torch.quantization import quantize_dynamic, prepare_qat, convert


class MLP(nn.Module):
    def __init__(self, input_dim: int, hidden_dim: int, output_dim: int):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, output_dim),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.layers(x)


def apply_dynamic_quantization(model: nn.Module) -> nn.Module:
    """
    Dynamic quantization: weights are quantized to INT8, activations
    are computed in FP32. Simplest approach, no calibration needed.
    Best for: LSTM, embedding-heavy models, CPU serving.
    """
    quantized = quantize_dynamic(
        model,
        qconfig_spec={nn.Linear},  # Quantize Linear layers
        dtype=torch.qint8,
    )
    return quantized


def apply_static_quantization(
    model: nn.Module,
    calibration_data: torch.Tensor,
) -> nn.Module:
    """
    Static quantization: both weights and activations quantized to INT8.
    Requires calibration dataset to measure activation ranges.
    Best for: CNNs, MLPs, GPU serving. Higher speedup than dynamic.
    """
    model.eval()
    model.qconfig = torch.quantization.get_default_qconfig("fbgemm")  # For x86 CPU

    # Prepare: insert observers to measure activation ranges
    prepared = torch.quantization.prepare(model)

    # Calibrate: run representative data through the model
    with torch.no_grad():
        prepared(calibration_data)

    # Convert: replace FP32 operations with INT8 equivalents
    quantized = torch.quantization.convert(prepared)

    return quantized


def benchmark_quantization(
    model_fp32: nn.Module,
    model_int8: nn.Module,
    batch_size: int = 32,
    feature_dim: int = 512,
    n_runs: int = 1000,
):
    """Compare latency: FP32 vs INT8 on CPU."""
    import time

    x = torch.randn(batch_size, feature_dim)

    # FP32 benchmark
    model_fp32.eval()
    with torch.no_grad():
        for _ in range(10):  # warmup
            model_fp32(x)

    start = time.perf_counter()
    with torch.no_grad():
        for _ in range(n_runs):
            model_fp32(x)
    fp32_ms = (time.perf_counter() - start) / n_runs * 1000

    # INT8 benchmark
    model_int8.eval()
    with torch.no_grad():
        for _ in range(10):
            model_int8(x)

    start = time.perf_counter()
    with torch.no_grad():
        for _ in range(n_runs):
            model_int8(x)
    int8_ms = (time.perf_counter() - start) / n_runs * 1000

    print(f"FP32: {fp32_ms:.2f}ms per batch ({batch_size} examples)")
    print(f"INT8: {int8_ms:.2f}ms per batch ({batch_size} examples)")
    print(f"Speedup: {fp32_ms / int8_ms:.1f}x")

TorchScript and torch.compile

# model_compilation.py
import torch
import time


def compile_model_for_latency(
    model: torch.nn.Module,
    example_input: torch.Tensor,
    method: str = "torchscript",
) -> torch.nn.Module:
    """
    Compile a model for lower inference latency.

    Methods:
    - "torchscript": JIT compilation, works everywhere, good for deployment
    - "compile": torch.compile with Inductor backend, best for CUDA
    - "onnx": Export to ONNX for cross-platform deployment
    """
    model.eval()

    if method == "torchscript":
        # Trace the model with an example input
        # Best for models with fixed control flow
        with torch.no_grad():
            scripted = torch.jit.trace(model, example_input)
            # Optimize the scripted model
            scripted = torch.jit.optimize_for_inference(scripted)
        return scripted

    elif method == "compile":
        # torch.compile with Triton/Inductor backend (PyTorch 2.0+)
        # Fuses operators, generates optimized CUDA kernels
        compiled = torch.compile(
            model,
            backend="inductor",
            mode="reduce-overhead",  # Minimize CUDA kernel launch overhead
        )
        # Warmup to trigger compilation
        with torch.no_grad():
            for _ in range(5):
                compiled(example_input)
        return compiled

    elif method == "onnx":
        import onnx
        import onnxruntime as ort

        onnx_path = "/tmp/model_optimized.onnx"
        torch.onnx.export(
            model,
            example_input,
            onnx_path,
            input_names=["input"],
            output_names=["output"],
            dynamic_axes={"input": {0: "batch_size"}},
            opset_version=17,
            do_constant_folding=True,
        )

        sess_options = ort.SessionOptions()
        sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
        session = ort.InferenceSession(
            onnx_path,
            sess_options=sess_options,
            providers=["CUDAExecutionProvider", "CPUExecutionProvider"],
        )
        return session
    else:
        raise ValueError(f"Unknown method: {method}")

Batching Strategies for Latency

The tension in batching: larger batches increase throughput (GPU utilization) but increase p99 latency (requests wait for batch formation). For latency-critical workloads, use adaptive batching:

# adaptive_batcher.py
import asyncio
import time
import numpy as np
from dataclasses import dataclass, field
from typing import Optional
import torch


@dataclass
class InferenceRequest:
    request_id: str
    features: np.ndarray
    future: asyncio.Future
    arrival_time: float = field(default_factory=time.perf_counter)


class AdaptiveBatcher:
    """
    Dynamic batcher that adapts batch formation timeout based on current load.

    At low load: short timeout to maintain low latency
    At high load: longer timeout to form larger batches (better GPU utilization)
    """

    def __init__(
        self,
        model: torch.nn.Module,
        max_batch_size: int = 64,
        min_wait_ms: float = 1.0,    # Minimum: 1ms wait at low load
        max_wait_ms: float = 20.0,   # Maximum: 20ms wait at high load
        load_threshold_rps: float = 100.0,
    ):
        self.model = model
        self.max_batch_size = max_batch_size
        self.min_wait = min_wait_ms / 1000
        self.max_wait = max_wait_ms / 1000
        self.load_threshold = load_threshold_rps
        self.pending: list[InferenceRequest] = []
        self.lock = asyncio.Lock()
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model.to(self.device).eval()

        # Load tracking
        self.request_times: list[float] = []
        self.window_seconds = 5.0

    def _estimate_rps(self) -> float:
        """Estimate requests per second from recent request history."""
        now = time.perf_counter()
        self.request_times = [t for t in self.request_times if now - t < self.window_seconds]
        return len(self.request_times) / self.window_seconds

    def _adaptive_wait(self) -> float:
        """Compute wait time based on current load."""
        rps = self._estimate_rps()
        load_fraction = min(1.0, rps / self.load_threshold)
        return self.min_wait + (self.max_wait - self.min_wait) * load_fraction

    async def predict(self, features: np.ndarray, request_id: str) -> np.ndarray:
        """Submit a request for batched inference."""
        loop = asyncio.get_event_loop()
        future = loop.create_future()
        req = InferenceRequest(
            request_id=request_id,
            features=features,
            future=future,
        )

        async with self.lock:
            self.request_times.append(time.perf_counter())
            self.pending.append(req)

            if len(self.pending) >= self.max_batch_size:
                asyncio.create_task(self._process_batch())

        return await future

    async def _flush_loop(self):
        """Background loop that flushes on timeout."""
        while True:
            wait = self._adaptive_wait()
            await asyncio.sleep(wait)
            async with self.lock:
                if self.pending:
                    oldest_wait = time.perf_counter() - self.pending[0].arrival_time
                    if oldest_wait >= self.min_wait:
                        asyncio.create_task(self._process_batch())

    async def _process_batch(self):
        """Process all pending requests as a single GPU batch."""
        async with self.lock:
            if not self.pending:
                return
            batch = self.pending[:self.max_batch_size]
            self.pending = self.pending[self.max_batch_size:]

        features_np = np.stack([r.features for r in batch])
        batch_latency_start = time.perf_counter()

        try:
            with torch.no_grad():
                x = torch.from_numpy(features_np).float().to(self.device)
                output = self.model(x)
                results = output.cpu().numpy()

            batch_latency_ms = (time.perf_counter() - batch_latency_start) * 1000

            for i, req in enumerate(batch):
                queue_wait_ms = (batch_latency_start - req.arrival_time) * 1000
                if not req.future.done():
                    req.future.set_result(results[i])
        except Exception as e:
            for req in batch:
                if not req.future.done():
                    req.future.set_exception(e)

Pre-Computation and Caching

The most powerful latency optimization: do not run the model at all. Pre-compute predictions and serve from cache.

# prediction_cache.py
import hashlib
import numpy as np
import redis
import time
from typing import Optional


class PredictionCache:
    """
    Two-level prediction cache for ML inference.

    L1: In-process LRU (microseconds)
    L2: Redis (1-2ms)

    Appropriate when: many requests share the same input
    (same user, same item, same query). Surprisingly common in
    search, recommendation, and fraud detection.
    """

    def __init__(
        self,
        redis_client: redis.Redis,
        l1_size: int = 10_000,
        l2_ttl_seconds: int = 30,
    ):
        self.redis = redis_client
        self.l2_ttl = l2_ttl_seconds
        # L1: simple dict with LRU eviction (use functools.lru_cache in real code)
        self.l1_cache: dict = {}
        self.l1_size = l1_size

    @staticmethod
    def _cache_key(features: np.ndarray) -> str:
        """Stable cache key from feature vector."""
        return hashlib.sha256(features.astype(np.float32).tobytes()).hexdigest()[:16]

    def get(self, features: np.ndarray) -> Optional[np.ndarray]:
        """Check L1 then L2 cache."""
        key = self._cache_key(features)

        # L1 check - microseconds
        if key in self.l1_cache:
            value, expiry = self.l1_cache[key]
            if time.time() < expiry:
                return value
            del self.l1_cache[key]

        # L2 check - 1-2ms Redis
        raw = self.redis.get(f"pred:{key}")
        if raw is not None:
            result = np.frombuffer(raw, dtype=np.float32)
            # Promote to L1
            self._set_l1(key, result)
            return result

        return None

    def set(self, features: np.ndarray, prediction: np.ndarray):
        """Store in both L1 and L2."""
        key = self._cache_key(features)
        self._set_l1(key, prediction)
        self.redis.setex(
            f"pred:{key}",
            self.l2_ttl,
            prediction.astype(np.float32).tobytes(),
        )

    def _set_l1(self, key: str, value: np.ndarray, ttl_seconds: float = 5.0):
        if len(self.l1_cache) >= self.l1_size:
            # Evict one random entry (simplified LRU)
            evict_key = next(iter(self.l1_cache))
            del self.l1_cache[evict_key]
        self.l1_cache[key] = (value, time.time() + ttl_seconds)

Memory Layout Optimization

GPU performance is heavily influenced by memory access patterns. Tensors with poor layout cause non-coalesced memory accesses that reduce throughput by 10-30x:

# memory_layout.py
import torch
import time


def benchmark_memory_layout(batch_size: int = 64, seq_len: int = 128, hidden: int = 768):
    """
    Show the latency difference between optimal and suboptimal tensor layouts.
    """
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = torch.nn.Linear(hidden, hidden).to(device)

    # Suboptimal: non-contiguous tensor from a slice or transpose
    base = torch.randn(seq_len, batch_size, hidden, device=device)
    non_contiguous = base.transpose(0, 1)  # [batch, seq, hidden] - but non-contiguous
    assert not non_contiguous.is_contiguous()

    # Optimal: contiguous tensor in the right layout
    contiguous = non_contiguous.contiguous()  # Makes a contiguous copy

    n = 200
    with torch.no_grad():
        # Warmup
        for _ in range(10):
            model(contiguous[:, 0, :])

        # Non-contiguous benchmark
        torch.cuda.synchronize()
        t0 = time.perf_counter()
        for _ in range(n):
            x = non_contiguous.reshape(batch_size * seq_len, hidden)
            model(x)
        torch.cuda.synchronize()
        non_contig_ms = (time.perf_counter() - t0) / n * 1000

        # Contiguous benchmark
        torch.cuda.synchronize()
        t0 = time.perf_counter()
        for _ in range(n):
            x = contiguous.reshape(batch_size * seq_len, hidden)
            model(x)
        torch.cuda.synchronize()
        contig_ms = (time.perf_counter() - t0) / n * 1000

    print(f"Non-contiguous: {non_contig_ms:.2f}ms")
    print(f"Contiguous:     {contig_ms:.2f}ms")
    print(f"Speedup: {non_contig_ms / contig_ms:.1f}x")

CUDA Pinned Memory and Async Transfers

CPU-to-GPU memory transfer is a significant latency source. Pinned memory (page-locked) enables asynchronous transfers that overlap with GPU computation:

# cuda_pinned_memory.py
import torch
import numpy as np
import time


class PinnedMemoryBuffer:
    """
    Pre-allocated pinned memory buffer for zero-copy CPU-GPU transfers.
    Eliminates memory allocation overhead on the critical path.
    """

    def __init__(self, max_batch_size: int, feature_dim: int):
        # Pre-allocate pinned memory once at startup
        self.pinned_buffer = torch.zeros(
            max_batch_size,
            feature_dim,
            dtype=torch.float32,
            pin_memory=True,  # Lock in RAM - enables async DMA transfer
        )
        self.device = torch.device("cuda")

    def copy_to_gpu_async(
        self,
        features: np.ndarray,
        stream: torch.cuda.Stream,
    ) -> torch.Tensor:
        """
        Copy numpy features to GPU using async DMA transfer.
        The CPU can continue doing other work while the transfer happens.
        """
        batch_size = features.shape[0]

        # Copy numpy array into pre-allocated pinned buffer (CPU-CPU copy)
        self.pinned_buffer[:batch_size].numpy()[:] = features

        # Async transfer: DMA copies while CPU continues
        with torch.cuda.stream(stream):
            gpu_tensor = self.pinned_buffer[:batch_size].to(
                self.device,
                non_blocking=True,  # Async copy
            )

        return gpu_tensor


def benchmark_transfer_strategies(batch_size: int = 32, feature_dim: int = 512):
    if not torch.cuda.is_available():
        print("CUDA not available, skipping benchmark")
        return

    device = torch.device("cuda")
    features = np.random.randn(batch_size, feature_dim).astype(np.float32)
    n = 500

    # Strategy 1: Standard transfer (allocates new memory each time)
    torch.cuda.synchronize()
    t0 = time.perf_counter()
    for _ in range(n):
        x = torch.from_numpy(features).to(device)
        torch.cuda.synchronize()
    standard_ms = (time.perf_counter() - t0) / n * 1000

    # Strategy 2: Pinned memory transfer
    buffer = PinnedMemoryBuffer(batch_size, feature_dim)
    stream = torch.cuda.Stream()

    torch.cuda.synchronize()
    t0 = time.perf_counter()
    for _ in range(n):
        x = buffer.copy_to_gpu_async(features, stream)
        torch.cuda.current_stream().wait_stream(stream)
        torch.cuda.synchronize()
    pinned_ms = (time.perf_counter() - t0) / n * 1000

    print(f"Standard transfer: {standard_ms:.3f}ms")
    print(f"Pinned memory: {pinned_ms:.3f}ms")
    print(f"Speedup: {standard_ms / pinned_ms:.1f}x")

Real Production Targets

System	Latency Requirement	Technique
Google Search ranking	p99 under 50ms	Distilled models, TensorFlow XLA compilation
Meta News Feed	p99 under 30ms	INT8 quantization, FBGEMM, custom CUDA kernels
Meta Ads (Advantage+ ranking)	p99 under 10ms	INT8, model distillation, pre-computed embeddings
Stripe fraud detection	p99 under 100ms	CPU inference, no batching, XGBoost
TikTok recommendation	p99 under 30ms	GPU with dynamic batching, model parallelism
Apple Face ID	under 1ms	Apple Neural Engine, CoreML, on-device INT8

Production Engineering Notes

Profile before optimizing: Use PyTorch's built-in profiler to identify where time is actually spent before applying any optimization. Engineers consistently misjudge where the bottleneck is.

import torch.profiler

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ],
    record_shapes=True,
    with_stack=True,
) as prof:
    with torch.no_grad():
        model(features)

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

CUDA kernel launch overhead compounds: Each PyTorch operation that runs on GPU has a ~10-50 microsecond kernel launch overhead. A model with 100 small operations can spend 2-5ms in kernel launches alone. Fusion (combining multiple operations into one kernel) eliminates this. torch.compile does this automatically for most models.

Latency vs throughput is a trade-off you must make explicitly: The settings that minimize p99 latency (small max_batch_size, short wait time) are opposite to the settings that maximize throughput (large batch, long wait). For latency-critical systems, accept lower GPU utilization to maintain latency targets. Define your SLA first, then optimize for throughput within that constraint.

:::warning The p99 Trap - Averages Hide Tail Latency Most optimization efforts focus on average latency. But SLAs are typically defined in terms of p99 or p99.9. A model that runs in 5ms average but 200ms p99 (due to occasional GPU kernel recompilation, GC pauses, or network jitter) fails its SLA. Measure and optimize the tail. Use histogram metrics (Prometheus histogram) not averages. :::

:::danger GPU Warm-Up Latency The first inference request after a GPU model is loaded triggers JIT compilation, kernel warm-up, and CUDA graph initialization. This first request takes 10-100x longer than subsequent ones. Without explicit warm-up (running 10-100 dummy requests at startup), the first real user request hits this cold-start latency. Always warm up models during server initialization, before marking the instance as ready in health checks. :::

Interview Q&A

Q: How would you get a model serving endpoint from 40ms p99 to under 5ms p99?

Profile first to identify the bottleneck. Likely steps in order of impact: (1) Switch serialization from REST+JSON to gRPC+Protobuf - saves 1-5ms. (2) Apply INT8 quantization - saves 2-4x on computation time. (3) Compile with torch.compile or TorchScript + optimize_for_inference - saves another 1-2x. (4) Use CUDA pinned memory for transfers - saves 0.5-1ms. (5) Implement dynamic batching to amortize kernel launch overhead. (6) Use prediction caching if the input space has sufficient repetition. At each step, measure the actual improvement and verify accuracy is maintained.

Q: Why does INT8 quantization reduce latency and what is the accuracy trade-off?

INT8 quantization replaces 32-bit floating point weights and activations with 8-bit integers. Modern CPUs and GPUs have SIMD and tensor core instructions that process 4x more INT8 values per clock cycle than FP32, leading to 2-4x throughput improvement. The accuracy trade-off is typically 0.5-1% on classification tasks (measured by top-1 accuracy or AUC) with post-training quantization, or less than 0.1% with quantization-aware training. The accuracy loss is higher for models that rely on fine-grained weight differences (attention mechanisms with small softmax values) and lower for embedding-heavy models. Always measure accuracy on your specific task with your quantized model before deployment.

Q: What is the difference between latency and throughput in ML serving, and how do you optimize for each?

Latency is the time for one request to complete. Throughput is the number of requests per unit time. They are often in tension. Maximizing throughput: use large batches (higher GPU utilization), longer wait times for batch formation, and as many concurrent model instances as GPU memory allows. Minimizing latency: use small or no batches (no wait time), prioritize in-flight requests over new ones (preemptive scheduling), and ensure GPU memory is not over-provisioned (which causes cache evictions). For latency-critical systems, set a maximum batch formation wait time (e.g., 5ms) and never exceed it, even if that means smaller batches and lower GPU utilization.

Q: How does pre-computation reduce inference latency and when is it applicable?

Pre-computation runs the model before the request arrives and stores the result. At request time, serving is a key-value lookup (1-2ms) instead of model inference (10-50ms). It is applicable when: inputs are from a known finite set (user IDs, item IDs, query templates), freshness requirements allow results to be computed in advance (predictions that are valid for 30+ minutes), and the total pre-computation space is manageable (cannot pre-compute for every possible input combination). Common applications: user embedding pre-computation in recommendation, item score pre-computation for top-K retrieval, query-expansion dictionaries in search. Not applicable for: fraud detection (depends on the specific transaction), real-time bidding (depends on the specific page + user context at the moment of the bid).

Q: What is CUDA kernel launch overhead and how does operator fusion reduce it?

Every PyTorch operation that executes on the GPU submits a CUDA kernel - a program that runs on GPU cores. Submitting a kernel from the CPU takes 10-50 microseconds. A model with 100 sequential operations (attention heads, layer norms, activations) submits 100 kernels, costing 1-5ms in launch overhead alone. Operator fusion combines multiple operations into a single kernel. Instead of three kernels (layer norm, matmul, activation), one fused kernel handles all three. torch.compile's Inductor backend does this automatically: it analyzes the computation graph and generates fused CUDA kernels that match what a hand-written CUDA expert would write. For transformer models, kernel fusion alone provides 2-3x latency improvement.

The Production Scenario​

Why This Exists - The Latency Ceiling of Naive Deployment​

Historical Context​

The Latency Stack​

Hardware Choices for Low Latency​

Model Optimization for Latency​

Quantization​

TorchScript and torch.compile​

Batching Strategies for Latency​

Pre-Computation and Caching​

Memory Layout Optimization​

CUDA Pinned Memory and Async Transfers​

Real Production Targets​

Production Engineering Notes​

Interview Q&A​