What is low latency ML?

Engineering for ultra-low latency inference - NUMA awareness, CPU affinity, memory pre-allocation, lock-free data structures, cache line optimization, zero-copy inference, CUDA streams, and kernel profiling.

How does NUMA awareness work in practice?

Low-Latency Optimization covers low latency ML, NUMA awareness, CPU affinity from first principles with code examples. Free lesson at https://engineersofai.com/docs/ai-systems/real-time-ml/low-latency-optimization

What is the difference between low latency ML and CPU affinity?

See the full breakdown at https://engineersofai.com/docs/ai-systems/real-time-ml/low-latency-optimization

:::tip 🎮 Interactive Playground Visualize this concept: Try the Latency vs Throughput demo on the EngineersOfAI Playground - no code required. :::

Low-Latency Optimization

From 8ms to 0.6ms: A Systematic Hunt

The trading algorithm team has a problem that money cannot easily solve. Their XGBoost model predicts short-term price movements and triggers buy/sell signals. The model runs in 8ms. Their competition's model runs in under 1ms. In high-frequency trading, being 7ms slower means you never get filled on the good trades - the opportunity is gone by the time your signal fires.

The ML engineer assigned to the optimization project starts systematically. First: profile. She runs perf stat on the inference process. The output is damning: 32% of time is in kernel transitions (system calls), 18% is in memory allocation/deallocation, 12% is in mutex locks in the thread pool, 9% is in NUMA remote memory access. The XGBoost model itself - the actual tree traversals - accounts for only 29% of total time.

She attacks each root cause independently:

Replace per-request memory allocation with pre-allocated fixed buffers: +2.1ms savings
Pin the inference thread to a specific CPU core (no context switches, no NUMA hops): +1.8ms savings
Replace the mutex-locked feature map with a lock-free ring buffer: +0.9ms savings
Pre-populate the CPU L1/L2 cache with model weights at startup: +0.6ms savings
Eliminate all logging in the hot path: +0.3ms savings
Tune XGBoost prediction for CPU vector instructions (AVX2): +0.7ms savings

Total: 6.4ms savings. Inference: 1.6ms. With FPGA acceleration for the final tree traversals: 0.6ms.

The model did not change. The algorithm did not change. The world changed - the latency wall moved from 8ms to 0.6ms through pure systems engineering.

Why This Exists - The Long Tail of Latency

At 10ms latency, most inefficiencies are invisible. At 1ms latency, every microsecond matters. The sources of latency that are acceptable overhead at 10ms become the dominant cost at sub-millisecond targets:

Memory allocation: malloc() takes 0.1-2μs per call, depending on heap state
System calls: read(), write(), clock_gettime() each take 0.1-1μs
Cache misses: accessing a variable not in L1 cache costs 4-200ns (L2 to DRAM)
NUMA remote access: accessing memory on a different CPU socket costs 60-100ns additional
Thread scheduling: a thread waking from sleep has 10-50μs of kernel scheduling overhead
Mutex contention: acquiring a contested mutex costs 1-50μs plus cache line bouncing
Context switches: the OS preempting your thread and running another can add 10-100μs

At 10ms inference, these add maybe 0.5-1ms of overhead - 5-10%, negligible. At 0.6ms inference, the same sources add 0.5ms - 83% overhead. Low-latency engineering is the systematic identification and elimination of every source of variable-cost overhead.

Historical Context

Low-latency software engineering has its roots in high-frequency trading (HFT), which emerged as a discipline around 2005-2010. Companies like Virtu Financial, Citadel, and Jane Street spent hundreds of millions building infrastructure to shave microseconds off trade execution. The techniques they developed - NUMA-aware threading, lock-free data structures, CPU pinning, kernel bypass networking - migrated from HFT to real-time gaming, then to low-latency ML inference.

The Intel VTune profiler (1990s, widely used from 2005) and Linux perf tool (2009) enabled engineers to identify bottlenecks at the CPU cycle level. Without these tools, sub-millisecond optimization is guesswork.

NVIDIA's CUDA Streams (introduced in CUDA 2.x, 2008) enabled overlapping data transfer and computation on the GPU - a critical optimization for inference pipelines where input preparation and GPU computation can be pipelined.

The C10k problem (2000, Kegel) drew attention to the fact that OS-level abstractions (threads, blocking I/O) were insufficient for high-concurrency low-latency services. Solutions - epoll, io_uring, zero-copy networking - gradually migrated from network servers to ML inference pipelines.

The Latency Stack

NUMA Awareness and CPU Affinity

Modern servers have multiple CPU sockets. Each socket has its own local RAM. Accessing memory on a different socket (NUMA remote) costs 60-100ns more than local access. For inference code running on socket 0, model weights stored in socket 1's RAM cost extra.

# numa_setup.py - NUMA-aware process configuration
import os
import ctypes
import subprocess
from typing import Optional

def get_cpu_info() -> dict:
    """Get NUMA topology of the current machine."""
    result = subprocess.run(
        ["numactl", "--hardware"],
        capture_output=True, text=True
    )
    return {"topology": result.stdout}

def pin_process_to_numa_node(node: int = 0):
    """
    Pin the current process to a NUMA node.
    All memory allocations and CPU scheduling stay on this node.
    Eliminates NUMA remote access overhead.
    """
    import numa  # pip install numa

    # Bind memory allocations to this node
    numa.set_membind({node})

    # Get CPUs on this node
    cpus = numa.node_to_cpus(node)
    cpu_mask = 0
    for cpu in cpus:
        cpu_mask |= (1 << cpu)

    # Set CPU affinity
    os.sched_setaffinity(0, cpus)

    print(f"Process pinned to NUMA node {node}: CPUs {list(cpus)}")
    return cpus


def pin_thread_to_cpu(cpu_id: int):
    """
    Pin current thread to a specific CPU core.
    Prevents OS from migrating thread, eliminating cache warm-up delays.
    """
    os.sched_setaffinity(0, {cpu_id})
    print(f"Thread pinned to CPU {cpu_id}")


# Usage: pin inference thread to an isolated core
# Configure kernel with isolcpus=4,5,6,7 to prevent OS from using these cores
# Then pin inference threads to cores 4-7 for deterministic low latency

def setup_isolated_inference_thread(cpu_id: int):
    """
    Set up an inference thread on an isolated CPU core.
    Maximum isolation: no OS scheduling interference.
    """
    import threading
    import queue

    request_queue = queue.Queue(maxsize=100)
    result_store = {}

    def inference_loop():
        # Pin to isolated CPU
        pin_thread_to_cpu(cpu_id)

        # Set real-time scheduling priority (requires CAP_SYS_NICE)
        try:
            param = ctypes.c_int(99)  # max priority
            libc = ctypes.CDLL("libc.so.6", use_errno=True)
            SCHED_FIFO = 1
            libc.sched_setscheduler(0, SCHED_FIFO, ctypes.byref(param))
            print(f"Real-time priority set on CPU {cpu_id}")
        except Exception as e:
            print(f"Could not set RT priority (need CAP_SYS_NICE): {e}")

        # Warm up caches before serving
        _warmup_model_cache()

        # Inference loop - no syscalls, no allocations
        while True:
            request_id, features = request_queue.get()
            result = _fast_predict(features)  # pre-allocated buffers
            result_store[request_id] = result

    thread = threading.Thread(target=inference_loop, daemon=True)
    thread.start()
    return request_queue, result_store

Memory Pre-Allocation and Arena Allocators

Per-request memory allocation (malloc()/free()) is expensive and creates non-deterministic latency spikes from heap fragmentation and GC.

# arena_allocator.py - pre-allocated fixed-size buffers for zero-alloc inference
import numpy as np
from typing import Optional
import ctypes

class InferenceBufferPool:
    """
    Pre-allocates a fixed pool of numpy arrays for inference.
    Zero allocation in the hot path - borrow/return pattern.
    """

    def __init__(
        self,
        batch_size: int,
        feature_dim: int,
        output_dim: int,
        pool_size: int = 32,  # number of pre-allocated buffer sets
        dtype = np.float32
    ):
        self.batch_size = batch_size
        self.feature_dim = feature_dim
        self.output_dim = output_dim
        self.dtype = dtype

        # Pre-allocate all buffers upfront
        # Use page-locked (pinned) memory for faster GPU transfer
        self._input_buffers = [
            np.zeros((batch_size, feature_dim), dtype=dtype)
            for _ in range(pool_size)
        ]
        self._output_buffers = [
            np.zeros((batch_size, output_dim), dtype=dtype)
            for _ in range(pool_size)
        ]

        # Track available buffer indices
        import queue
        self._available = queue.Queue()
        for i in range(pool_size):
            self._available.put(i)

    def acquire(self, timeout_ms: float = 5.0):
        """
        Borrow a buffer set from the pool (blocking up to timeout_ms).
        Returns (buffer_index, input_buffer, output_buffer).
        """
        import queue
        try:
            idx = self._available.get(timeout=timeout_ms / 1000)
            return idx, self._input_buffers[idx], self._output_buffers[idx]
        except queue.Empty:
            raise RuntimeError("Buffer pool exhausted - increase pool_size or reduce QPS")

    def release(self, idx: int):
        """Return buffer set to pool after inference completes."""
        self._available.put(idx)


# In C++ (for Python extension or native inference):
# struct ArenaAllocator {
#     void* memory;
#     size_t offset;
#     size_t capacity;
#
#     void* alloc(size_t size) {
#         if (offset + size > capacity) return nullptr;  // no fragmentation
#         void* ptr = (char*)memory + offset;
#         offset += (size + 63) & ~63;  // align to cache line
#         return ptr;
#     }
#
#     void reset() { offset = 0; }  // free entire arena at once - O(1)
# };

Lock-Free Data Structures

Mutex locks introduce both blocking overhead and cache line bouncing (the thread waiting for the lock repeatedly invalidates the cache line holding the lock state). Lock-free algorithms use CPU atomic operations (compare-and-swap) instead.

# lock_free.py - lock-free single-producer single-consumer ring buffer
# Python implementation for illustration; use C++ in production for real HFT
import ctypes
import threading
from typing import Optional, TypeVar, Generic

T = TypeVar('T')

class SPSCRingBuffer:
    """
    Single-Producer Single-Consumer lock-free ring buffer.
    Used to pass inference requests from network thread to inference thread
    without mutex overhead.

    In Python, ctypes operations are not truly lock-free due to the GIL.
    This illustrates the pattern; use C++ atomic<> in production.
    """

    def __init__(self, capacity: int):
        assert (capacity & (capacity - 1)) == 0, "Capacity must be power of 2"
        self.capacity = capacity
        self.mask = capacity - 1
        self.buffer = [None] * capacity

        # Head/tail are cache-line aligned to prevent false sharing
        # In C++: alignas(64) atomic<uint64_t> head, tail;
        self._head = 0  # producer writes here
        self._tail = 0  # consumer reads from here

    def try_push(self, item) -> bool:
        """
        Try to add an item. Returns False if buffer is full.
        Designed for single producer - no producer-side locking.
        """
        head = self._head
        next_head = (head + 1) & self.mask

        if next_head == self._tail:
            return False  # buffer full

        self.buffer[head & self.mask] = item
        # Memory barrier: ensure item is written before head is updated
        # In C++: head.store(next_head, std::memory_order_release)
        self._head = next_head
        return True

    def try_pop(self) -> Optional[object]:
        """
        Try to remove an item. Returns None if buffer is empty.
        Designed for single consumer - no consumer-side locking.
        """
        tail = self._tail

        if tail == self._head:
            return None  # buffer empty

        item = self.buffer[tail & self.mask]
        # In C++: tail.store((tail+1) & mask, std::memory_order_release)
        self._tail = (tail + 1) & self.mask
        return item

Cache Line Optimization

CPU cache lines are 64 bytes. When two threads access different variables in the same cache line, they create "false sharing" - every write by one thread invalidates the other thread's cache entry, even though they are not accessing the same variable.

# cache_line.py - cache line alignment to prevent false sharing
import ctypes
import numpy as np

# Bad: metrics struct with fields that are written by different threads
# Both `requests` and `errors` are in the same cache line
# Thread 1 writes requests, Thread 2 writes errors → false sharing
class BadMetrics(ctypes.Structure):
    _fields_ = [
        ("requests", ctypes.c_uint64),   # written by thread 1
        ("errors", ctypes.c_uint64),     # written by thread 2
        # These are adjacent in memory → same cache line
    ]

# Good: pad each field to a full cache line (64 bytes)
CACHE_LINE_SIZE = 64

class CacheLineAligned(ctypes.Structure):
    """Each metric occupies its own 64-byte cache line."""
    _fields_ = [
        ("requests", ctypes.c_uint64),
        ("_pad1", ctypes.c_uint8 * (CACHE_LINE_SIZE - 8)),  # pad to 64 bytes
        ("errors", ctypes.c_uint64),
        ("_pad2", ctypes.c_uint8 * (CACHE_LINE_SIZE - 8)),
        ("latency_sum_ns", ctypes.c_uint64),
        ("_pad3", ctypes.c_uint8 * (CACHE_LINE_SIZE - 8)),
    ]

# For numpy arrays used in inference:
# Align array start to cache line boundary
def allocate_aligned(shape, dtype=np.float32, alignment=64):
    """Allocate numpy array aligned to cache line boundary."""
    size = np.prod(shape) * np.dtype(dtype).itemsize
    # Allocate extra space for alignment
    raw = np.zeros(size + alignment, dtype=np.uint8)
    # Find aligned start
    offset = (alignment - (raw.ctypes.data % alignment)) % alignment
    aligned = raw[offset:offset + size].view(dtype).reshape(shape)
    return aligned

# Prefetch model weights into L1/L2 cache before the hot path:
def prefetch_model_weights(model_weights: np.ndarray):
    """
    Touch every cache line of model weights to bring them into L2/L3 cache.
    Run this in a background thread before peak traffic.
    """
    stride = 64 // model_weights.itemsize  # one cache line
    _ = model_weights.flat[::stride].sum()  # force load without writing

Zero-Copy Inference

Standard inference pipelines copy data multiple times: receive bytes → decode → allocate tensor → copy to tensor → forward pass → copy output → serialize → send. Each copy costs memory bandwidth and time.

# zero_copy.py - minimizing data copies in the inference hot path
import numpy as np
import torch
from typing import bytes as Bytes

class ZeroCopyInference:
    """
    Minimize memory copies on the inference hot path.
    Strategy: receive data directly into pre-allocated pinned buffers.
    """

    def __init__(self, feature_dim: int, batch_size: int):
        # Pre-allocate pinned (page-locked) CPU memory
        # Pinned memory can be DMA-transferred to GPU without CPU involvement
        self.cpu_input = torch.zeros(
            (batch_size, feature_dim),
            dtype=torch.float32,
            pin_memory=True   # page-locked memory for DMA transfers
        )
        self.cpu_output = torch.zeros(
            (batch_size, 1),
            dtype=torch.float32,
            pin_memory=True
        )

        # Pre-allocate GPU tensors
        self.gpu_input = torch.zeros(
            (batch_size, feature_dim),
            dtype=torch.float32,
            device="cuda"
        )
        self.gpu_output = torch.zeros(
            (batch_size, 1),
            dtype=torch.float32,
            device="cuda"
        )

        # CUDA streams for async operations
        self.transfer_stream = torch.cuda.Stream()
        self.compute_stream = torch.cuda.Stream()

    def predict_batch_zero_copy(
        self,
        raw_bytes: bytes,  # incoming features as raw float bytes
        n_samples: int,
        model: torch.nn.Module
    ) -> np.ndarray:
        """
        Predict using pre-allocated buffers.
        Zero heap allocation in the hot path.
        """
        # Decode directly into pre-allocated pinned CPU buffer
        # No intermediate tensor allocation
        np.frombuffer(raw_bytes, dtype=np.float32).reshape(
            (n_samples, -1)
        ).copy  # avoided: would allocate new buffer

        # Use frombuffer to create a view (zero-copy)
        input_view = np.frombuffer(raw_bytes, dtype=np.float32).reshape(
            (n_samples, self.cpu_input.shape[1])
        )

        # Copy view into pinned buffer (one copy, from incoming bytes to pinned)
        self.cpu_input[:n_samples].copy_(
            torch.from_numpy(input_view)
        )

        # Async transfer: pinned CPU → GPU (overlaps with CPU work)
        with torch.cuda.stream(self.transfer_stream):
            self.gpu_input[:n_samples].copy_(
                self.cpu_input[:n_samples], non_blocking=True
            )

        # Compute on GPU
        with torch.cuda.stream(self.compute_stream):
            self.compute_stream.wait_stream(self.transfer_stream)
            with torch.no_grad():
                self.gpu_output[:n_samples] = model(self.gpu_input[:n_samples])

        # Async transfer back: GPU → pinned CPU
        with torch.cuda.stream(self.transfer_stream):
            self.transfer_stream.wait_stream(self.compute_stream)
            self.cpu_output[:n_samples].copy_(
                self.gpu_output[:n_samples], non_blocking=True
            )

        # Synchronize only at the very end
        torch.cuda.synchronize()

        # Return numpy view of pinned memory (zero-copy)
        return self.cpu_output[:n_samples].numpy()

CUDA Streams for Overlapping Transfer and Compute

Without CUDA streams, data transfer (CPU to GPU) and computation are serialized. With streams, you can overlap transfer of batch N+1 with computation of batch N:

# cuda_streams.py - multi-stream pipelined inference
import torch
import time
from typing import List, callable

class PipelinedInference:
    """
    Uses two CUDA streams to overlap data transfer with compute.
    While GPU is computing batch N, batch N+1 is being transferred.
    """

    def __init__(self, model: torch.nn.Module, n_streams: int = 2):
        self.model = model
        self.streams = [torch.cuda.Stream() for _ in range(n_streams)]
        self.events = [torch.cuda.Event() for _ in range(n_streams)]

    def process_batches(self, batches: List[torch.Tensor]) -> List[torch.Tensor]:
        """Process a list of batches using pipelined transfer/compute."""
        n = len(batches)
        results = [None] * n

        for i, batch in enumerate(batches):
            stream = self.streams[i % len(self.streams)]

            with torch.cuda.stream(stream):
                # Transfer to GPU (overlaps with compute on other stream)
                gpu_batch = batch.to("cuda", non_blocking=True)

                # Compute
                with torch.no_grad():
                    output = self.model(gpu_batch)

                # Transfer result back
                results[i] = output.cpu()

        # Wait for all streams to complete
        torch.cuda.synchronize()
        return results

Profiling the Hot Path

Before optimizing, measure. Python's cProfile is too coarse for microsecond analysis. Use perf on Linux:

# Profile inference binary at function level
perf record -g -F 1000 python inference_server.py &
sleep 30
kill %1
perf report --stdio | head -50

# Check for cache misses specifically
perf stat -e cache-references,cache-misses,instructions,cycles python inference_server.py

# CPU utilization per core
perf stat -e cpu-cycles,instructions,branch-misses,L1-dcache-loads,L1-dcache-load-misses python inference_server.py

# latency_histogram.py - high-resolution latency measurement
import time
import statistics
from typing import List

def measure_inference_latency(
    predict_fn: callable,
    sample_inputs: list,
    n_warmup: int = 100,
    n_measure: int = 1000
) -> dict:
    """
    Measure inference latency distribution accurately.
    Uses CLOCK_MONOTONIC_RAW for highest resolution.
    """
    # Warmup: let JIT compile, caches warm, frequency governor settle
    for inp in sample_inputs[:n_warmup]:
        predict_fn(inp)

    # Measurement
    latencies_ns = []
    for i in range(n_measure):
        inp = sample_inputs[i % len(sample_inputs)]

        start_ns = time.perf_counter_ns()
        result = predict_fn(inp)
        end_ns = time.perf_counter_ns()

        latencies_ns.append(end_ns - start_ns)

    latencies_us = [ns / 1000 for ns in latencies_ns]

    return {
        "p50_us": statistics.median(latencies_us),
        "p95_us": statistics.quantiles(latencies_us, n=100)[94],
        "p99_us": statistics.quantiles(latencies_us, n=100)[98],
        "p999_us": statistics.quantiles(latencies_us, n=1000)[998],
        "min_us": min(latencies_us),
        "max_us": max(latencies_us),
        "mean_us": statistics.mean(latencies_us),
    }

Common Mistakes

:::danger Optimizing Before Profiling The single most common mistake in low-latency engineering: optimizing the wrong thing. Engineers assume the model computation is the bottleneck. In reality, the biggest wins come from eliminating memory allocations, system calls, and cache misses - overhead that has nothing to do with the ML algorithm. Always profile first. Tools: perf on Linux, Instruments on macOS, VTune on Windows/Linux. Find where time actually goes before changing anything. :::

:::danger Python for Sub-Millisecond Inference CPython has a GIL (Global Interpreter Lock) that prevents true parallelism and adds non-deterministic overhead. For sub-millisecond inference targets, Python is often the wrong language for the inference hot path. Options: (a) use a Python C extension for the hot path; (b) use PyTorch's Libtorch C++ API directly; (c) compile the model to a native library (TensorRT engine, ONNX Runtime C++ API, XGBoost native C++ prediction). Python is excellent for orchestration, preprocessing, and postprocessing - just not for the innermost inference loop when you need deterministic sub-millisecond latency. :::

:::warning CPU Frequency Scaling Wrecking Your Benchmarks Modern CPUs run at low frequency when idle and scale up to base/boost clock under load. Cold benchmarks (first few iterations) may run at 1.2GHz while production runs at 3.5GHz - making benchmarks 3× slower than production. Conversely, sustained benchmarks may trigger thermal throttling that does not occur in production with bursty traffic. Fix: pin CPU frequency during benchmarking (cpupower frequency-set -g performance) and measure after 10-30 seconds of sustained load. :::

Interview Q&A

Q1: How would you systematically reduce inference latency from 8ms to under 1ms?

A: Profile first - do not guess. Use perf stat to identify whether the bottleneck is compute-bound (high instructions/cycle ratio) or memory-bound (high cache miss rate) or system-call-bound (high kernel time). Then attack root causes: (1) Memory allocation: replace per-request malloc() with pre-allocated arena buffers. (2) NUMA: pin inference thread to the same NUMA node as model weights; verify with numastat. (3) CPU affinity: pin the inference thread to an isolated core to eliminate context switch overhead. (4) Mutex elimination: replace thread-safe queues with lock-free ring buffers for single-producer/single-consumer paths. (5) Cache efficiency: pre-fetch model weights before peak traffic; align data structures to cache lines. (6) Zero-copy: use frombuffer() to create views instead of allocating new tensors; use pinned memory for GPU transfers. (7) Logging: all logging in the hot path adds system call overhead; buffer and write asynchronously. After each change, re-profile to confirm improvement and find the next bottleneck.

Q2: What is NUMA and why does it matter for low-latency ML inference?

A: NUMA (Non-Uniform Memory Access) is the memory architecture of multi-socket servers. Each CPU socket has local RAM; accessing remote RAM (on another socket) costs an additional 60-100ns. For a model whose weights are 100MB, accessing those weights from a remote NUMA node adds 60-100ns per cache miss vs local access. Modern servers may have 2-4 NUMA nodes. If your inference process is pinned to socket 0 but model weights are allocated on socket 1 (e.g., the model loading thread ran on socket 1), every cold cache miss pays the remote access penalty. Solution: use numactl --membind=0 --cpunodebind=0 python serve.py to force all memory allocations and CPU scheduling to NUMA node 0. Verify with numastat -p <pid> - other_node should be near zero.

Q3: What is false sharing in cache lines and how do you eliminate it?

A: False sharing occurs when two CPU threads access different variables that happen to reside in the same 64-byte cache line. When thread A writes its variable, the CPU must invalidate the cache line in all other caches (to maintain cache coherence) - which also invalidates thread B's variable, forcing thread B to reload the entire cache line even though its variable was not modified. This "false" sharing causes cache line bouncing between CPU cores, adding 100-300ns per access. Elimination: pad structures so each frequently written variable occupies its own cache line. In C++: alignas(64) std::atomic<uint64_t> request_count;. In Python: use ctypes with explicit padding to 64 bytes. Profile with perf stat -e cache-misses before and after to verify improvement.

Q4: How do CUDA streams improve GPU inference throughput?

A: By default, PyTorch operations on a single GPU are serialized on the default CUDA stream: transfer batch 1 → compute batch 1 → transfer batch 2 → compute batch 2. The GPU transfer bus and compute engines are never both busy simultaneously. CUDA streams allow assigning different operations to different streams that execute concurrently on the GPU. With two streams: while stream 1 computes batch 1, stream 2 transfers batch 2. The compute and transfer overlap, reducing total time from N * (transfer + compute) to roughly transfer + N * compute (the transfer is amortized). In practice, this gives 20-50% throughput improvement when transfer time is comparable to compute time. Use non_blocking=True on .to(device) calls to initiate transfer without blocking the CPU thread.

Q5: What is zero-copy inference and when does it matter?

A: Zero-copy inference minimizes the number of memory copies data undergoes from arriving at the server to getting a prediction result. Typical pipeline without optimization: network bytes received → bytes object → decode to numpy → copy to torch tensor → copy to GPU tensor → compute → copy to CPU tensor → copy to numpy → serialize to bytes. That is 6 memory copies. Each copy costs: allocation time, memcpy time proportional to data size, and cache pollution. Zero-copy techniques: use numpy.frombuffer() to create a view of received bytes without copying; use pinned memory for CPU-side tensors (enables DMA transfer to GPU without CPU involvement); use non_blocking=True for GPU transfers; use output tensors that are direct views of pinned memory for writing results. Zero-copy matters most when data volume is large (images, audio) and inference latency is short - when you are spending 30% of your total latency time in memory copies.

From 8ms to 0.6ms: A Systematic Hunt​

Why This Exists - The Long Tail of Latency​

Historical Context​

The Latency Stack​

NUMA Awareness and CPU Affinity​

Memory Pre-Allocation and Arena Allocators​

Lock-Free Data Structures​

Cache Line Optimization​

Zero-Copy Inference​

CUDA Streams for Overlapping Transfer and Compute​

Profiling the Hot Path​

Common Mistakes​

Interview Q&A​