HTTP/3 and QUIC

Your ML inference service is running well. Response times average 85ms. Then one Monday morning, an engineer on the product team mentions that users in Southeast Asia are seeing p99 latencies over 800ms. Not the median - the 99th percentile. Most users are fine. One in a hundred is miserable. You investigate. The model is fast. The GPU is not saturated. The compute takes 62ms. Something in the network stack is eating 738ms for 1% of requests.

The culprit turns out to be TCP head-of-line blocking amplified by packet loss. Your inference service uses HTTP/2 - modern, multiplexed, efficient. But HTTP/2 runs over TCP, and TCP is a strictly ordered byte stream. When a single packet from a request gets lost on the lossy mobile network that some Southeast Asian users are on, TCP's retransmission mechanism holds up every other multiplexed stream on that connection, even streams whose packets arrived perfectly. A 1% packet loss rate on a connection with 50 concurrent streams means roughly a 40% chance that some stream will be blocked waiting for a retransmit at any given moment. Your "85ms service" becomes an 800ms service for anyone with a slightly flaky connection.

This is the problem QUIC was designed to solve. QUIC moves transport-layer multiplexing off TCP and onto UDP, where it can implement its own stream management. When a packet is lost in a QUIC connection, only the streams that were actually using that packet are blocked - not every other stream on the connection. HTTP/3 is simply HTTP/2 semantics running over QUIC instead of TCP. The result is a protocol that degrades much more gracefully under the packet loss conditions that are the reality of mobile and intercontinental networks.

For ML serving teams, HTTP/3 matters most at the edges: inference APIs that serve mobile clients, real-time streaming inference (where WebTransport over QUIC provides a new architecture), and multi-region deployments where transcontinental packet loss rates are 1-3%. For internal microservice communication on reliable datacenter networks, HTTP/2 is still appropriate. Knowing when HTTP/3 helps - and when it does not - requires understanding what QUIC actually does differently.

This lesson teaches the protocol mechanics, the performance implications, and how to deploy HTTP/3 for ML inference serving. We will build working QUIC servers with aioquic, benchmark the latency difference across HTTP versions under simulated packet loss, and examine how Triton Inference Server and vLLM interact with HTTP/3 proxies. By the end, you will be able to make an informed decision about where HTTP/3 belongs in your ML serving stack.

Why This Exists

HTTP/1.1, published in 1997, solved a specific problem: make the web fast over TCP connections. It introduced persistent connections (keep-alive), which avoided the cost of establishing a new TCP connection for every request. But it was fundamentally serial: one request at a time per connection, with responses delivered in order. If you had 30 resources to load on a page, you needed 30 sequential round trips, or multiple connections (browsers opened 6 parallel connections per domain as a workaround).

HTTP/2, standardized in 2015, added multiplexing: multiple requests and responses could be in-flight simultaneously on a single TCP connection. This eliminated the "6 connections" workaround and reduced head-of-line blocking at the HTTP layer. A slow large resource would no longer block a fast small resource at the application level.

But HTTP/2 left one form of head-of-line blocking in place: TCP head-of-line blocking. TCP guarantees ordered delivery. When a packet is lost, TCP buffers all subsequent packets and waits for the retransmission of the lost packet before delivering anything to the application. All HTTP/2 streams on that connection are blocked, even streams whose data is sitting in the receive buffer, ready to go.

QUIC, developed at Google starting around 2012 and standardized as RFC 9000 in 2021, rebuilds the transport layer on UDP to fix this. It implements its own reliability, congestion control, flow control, and stream multiplexing - but it does stream multiplexing in a way where packet loss only blocks the stream that lost the packet, not all streams.

Historical Context

Google first deployed QUIC in Chrome and their servers (Google.com, YouTube, Google Drive) around 2013-2014 as an experiment. By 2017, QUIC was carrying about 7% of all internet traffic simply by being the protocol Chrome used with Google properties. The early results were compelling: Google reported 8% faster video load times on YouTube and 3% faster page loads on Google Search for users with QUIC support.

The IETF standardization process began in 2016 and took five years. The result, IETF QUIC (RFC 9000), is substantially different from Google QUIC (gQUIC) - it has better privacy properties (encrypting packet numbers to prevent surveillance), a more extensible framing layer, and cleaner integration with TLS 1.3. HTTP/3 (RFC 9114) standardized HTTP semantics over IETF QUIC in 2022.

Cloudflare and Fastly were among the first CDNs to deploy HTTP/3 at scale. By 2023, roughly 30% of all websites supported HTTP/3, and Chrome reports approximately 25% of connections use QUIC. The ML serving community began paying serious attention to HTTP/3 around 2023-2024 as inference serving moved from internal APIs to user-facing products where network conditions are unpredictable.

Core Concepts: The TCP Head-of-Line Blocking Problem

To understand QUIC, you need to understand exactly why TCP head-of-line blocking is unavoidable in HTTP/2.

TCP is a byte stream protocol. It guarantees that bytes arrive at the application in exactly the order they were sent. This guarantee is implemented with sequence numbers: every TCP segment has a sequence number, and the receiver buffers out-of-order segments until the missing sequence arrives.

HTTP/2 over TCP - 4 multiplexed streams:

Sender                           Receiver
  |---[Pkt 1: Stream A data]--->|  OK
  |---[Pkt 2: Stream B data]--->|  OK
  |---[Pkt 3: Stream C data]--->|  LOST
  |---[Pkt 4: Stream D data]--->|  Received, but BUFFERED (waiting for Pkt 3)
  |---[Pkt 5: Stream A data]--->|  Received, but BUFFERED (waiting for Pkt 3)
  |
  |  (200ms retransmit timeout)
  |
  |---[Pkt 3: retransmit]------->  OK
                                   NOW deliver Pkt3, Pkt4, Pkt5
                                   Streams A, C, D all unblocked at once

All four streams waited for packet 3, even though only Stream C was using packet 3's data.

QUIC implements per-stream ordering at the application layer, not the transport layer. Packet loss only blocks the specific stream that had data in the lost packet:

HTTP/3 over QUIC - same scenario:

Sender                           Receiver
  |---[Pkt 1: Stream A data]--->|  Stream A: delivered immediately
  |---[Pkt 2: Stream B data]--->|  Stream B: delivered immediately
  |---[Pkt 3: Stream C data]--->|  LOST
  |---[Pkt 4: Stream D data]--->|  Stream D: delivered immediately
  |---[Pkt 5: Stream A data]--->|  Stream A: delivered immediately
  |                              |  (Only Stream C is blocked)
  |
  |---[Pkt 3: retransmit]------->  Stream C: unblocked

Streams A, B, and D continue making progress while Stream C waits for retransmission. QUIC does not eliminate packet loss - it eliminates the cross-stream blocking that packet loss causes.

QUIC's Key Features

0-RTT Connection Establishment

TCP requires a 3-way handshake before data flows (1 round trip). TCP + TLS 1.2 requires 2 round trips. TCP + TLS 1.3 requires 1 round trip. QUIC with TLS 1.3 can establish a new connection in 1 round trip, and reconnect to a known server in 0 round trips (0-RTT).

For ML inference with short-lived connections - a mobile app making sporadic requests - this means the first request's latency is dramatically lower:

$\text{HTTP/1.1 first request} = \text{TCP handshake} + \text{TLS 1.2 handshake} + \text{request} = 3 \times \text{RTT}$

$\text{HTTP/3 first request (known server)} = \text{0-RTT QUIC} + \text{request} \approx 1 \times \text{RTT}$

On a 100ms RTT connection (mobile to remote datacenter), HTTP/3 saves 200ms on the first request.

Connection Migration

TCP connections are identified by a 4-tuple: (source IP, source port, destination IP, destination port). When a mobile device switches from WiFi to LTE, its IP address changes, the 4-tuple changes, and the TCP connection must be reestablished from scratch.

QUIC identifies connections by a 64-bit Connection ID that is independent of the IP address. When a client switches networks, it sends a PATH_CHALLENGE frame on the new path, receives a PATH_RESPONSE, and continues the existing QUIC connection without interruption. An in-flight inference request survives the network transition.

TLS 1.3 is Mandatory

QUIC does not have an unencrypted mode. TLS 1.3 is baked into the protocol - the handshake, key derivation, and header encryption are all part of QUIC itself. This has a side effect: QUIC packets are largely opaque to network middleboxes (firewalls, load balancers, intrusion detection systems) that rely on inspecting TCP headers. QUIC uses UDP port 443 by convention.

Code: aioquic Server and Client

"""
quic_inference_server.py - HTTP/3 inference server using aioquic.
Demonstrates QUIC connection handling, stream management, and 0-RTT.

Install: pip install aioquic
Generate TLS cert: openssl req -x509 -newkey rsa:2048 -keyout key.pem -out cert.pem -days 365 -nodes
"""

import asyncio
import json
import time
from typing import Dict, Optional

from aioquic.asyncio import serve
from aioquic.asyncio.protocol import QuicConnectionProtocol
from aioquic.h3.connection import H3Connection
from aioquic.h3.events import HeadersReceived, DataReceived, H3Event
from aioquic.quic.configuration import QuicConfiguration
from aioquic.quic.events import QuicEvent


class InferenceHandler:
    """Handles one ML inference request stream over HTTP/3."""

    def __init__(self, connection: H3Connection, stream_id: int):
        self._conn = connection
        self._stream_id = stream_id
        self._headers: Dict[str, str] = {}
        self._body = b""

    def handle_headers(self, headers: list):
        self._headers = {k.decode(): v.decode() for k, v in headers}

    def handle_data(self, data: bytes, stream_ended: bool):
        self._body += data
        if stream_ended:
            asyncio.ensure_future(self._process_request())

    async def _process_request(self):
        path = self._headers.get(":path", "/")
        method = self._headers.get(":method", "GET")

        if path == "/predict" and method == "POST":
            try:
                payload = json.loads(self._body)
                result = await self._run_inference(payload)
                response_body = json.dumps(result).encode()
                status = b"200"
            except Exception as e:
                response_body = json.dumps({"error": str(e)}).encode()
                status = b"500"
        elif path == "/health":
            response_body = b'{"status": "ok"}'
            status = b"200"
        else:
            response_body = b'{"error": "not found"}'
            status = b"404"

        self._conn.send_headers(
            stream_id=self._stream_id,
            headers=[
                (b":status", status),
                (b"content-type", b"application/json"),
                (b"content-length", str(len(response_body)).encode()),
                (b"server", b"quic-inference/1.0"),
                # Tell next client to use HTTP/3 for 24 hours
                (b"alt-svc", b'h3=":4433"; ma=86400'),
            ],
        )
        self._conn.send_data(
            stream_id=self._stream_id,
            data=response_body,
            end_stream=True,
        )

    async def _run_inference(self, payload: dict) -> dict:
        """Replace with actual model inference."""
        await asyncio.sleep(0.062)  # Simulate 62ms GPU inference
        return {
            "predictions": [0.92, 0.05, 0.03],
            "latency_ms": 62,
            "model": "classifier-v2",
            "protocol": "HTTP/3 over QUIC",
        }


class InferenceServerProtocol(QuicConnectionProtocol):
    """QUIC connection handler - creates H3 connection, dispatches streams."""

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._http: Optional[H3Connection] = None
        self._handlers: Dict[int, InferenceHandler] = {}

    def quic_event_received(self, event: QuicEvent):
        if self._http is None:
            self._http = H3Connection(self._quic, enable_webtransport=False)

        for h3_event in self._http.handle_event(event):
            self._h3_event_received(h3_event)

    def _h3_event_received(self, event: H3Event):
        if isinstance(event, HeadersReceived):
            handler = InferenceHandler(self._http, event.stream_id)
            handler.handle_headers(event.headers)
            self._handlers[event.stream_id] = handler
        elif isinstance(event, DataReceived):
            handler = self._handlers.get(event.stream_id)
            if handler:
                handler.handle_data(event.data, event.stream_ended)


async def run_quic_server(
    host: str = "0.0.0.0",
    port: int = 4433,
    cert_file: str = "cert.pem",
    key_file: str = "key.pem",
):
    """Start the QUIC/HTTP/3 inference server."""
    configuration = QuicConfiguration(is_client=False)
    configuration.load_cert_chain(cert_file, key_file)

    print(f"HTTP/3 inference server listening on {host}:{port}/UDP")
    print("  QUIC version: IETF RFC 9000")
    print("  TLS: 1.3 (mandatory, built into QUIC)")
    print("  0-RTT: enabled for session-resumed clients")

    await serve(host, port, configuration=configuration,
                create_protocol=InferenceServerProtocol)
    await asyncio.Future()  # Run forever

"""
http_version_benchmark.py - Compare HTTP/1.1, HTTP/2, and HTTP/3 inference latency.
Simulates packet loss via tc netem to show the HOL blocking difference.
"""

import asyncio
import time
import statistics
import subprocess
import httpx
from dataclasses import dataclass, field
from typing import List


@dataclass
class BenchmarkResult:
    protocol: str
    n_requests: int
    mean_ms: float
    median_ms: float
    p95_ms: float
    p99_ms: float
    max_ms: float
    packet_loss_pct: float = 0.0


def compute_percentiles(latencies: List[float], protocol: str,
                        loss: float = 0.0) -> BenchmarkResult:
    s = sorted(latencies)
    n = len(s)
    return BenchmarkResult(
        protocol=protocol,
        n_requests=n,
        mean_ms=round(statistics.mean(latencies), 2),
        median_ms=round(statistics.median(latencies), 2),
        p95_ms=round(s[int(n * 0.95)], 2),
        p99_ms=round(s[int(n * 0.99)], 2),
        max_ms=round(max(latencies), 2),
        packet_loss_pct=loss,
    )


async def bench_http3(base_url: str, payload: dict, n: int = 200) -> List[float]:
    """Benchmark HTTP/3 using httpx with h3 support."""
    latencies = []
    async with httpx.AsyncClient(http3=True, verify=False, timeout=30.0) as client:
        # Warmup to establish session ticket (enables 0-RTT)
        for _ in range(5):
            await client.post(f"{base_url}/predict", json=payload)

        for _ in range(n):
            t0 = time.perf_counter()
            await client.post(f"{base_url}/predict", json=payload)
            latencies.append((time.perf_counter() - t0) * 1000)

    return latencies


async def bench_http2(base_url: str, payload: dict, n: int = 200) -> List[float]:
    """Benchmark HTTP/2 with connection reuse (same as typical serving setup)."""
    latencies = []
    async with httpx.AsyncClient(http2=True, verify=False, timeout=30.0) as client:
        for _ in range(5):
            await client.post(f"{base_url}/predict", json=payload)
        for _ in range(n):
            t0 = time.perf_counter()
            await client.post(f"{base_url}/predict", json=payload)
            latencies.append((time.perf_counter() - t0) * 1000)
    return latencies


async def bench_http11(base_url: str, payload: dict, n: int = 200) -> List[float]:
    """Benchmark HTTP/1.1 with keep-alive."""
    latencies = []
    async with httpx.AsyncClient(http1=True, http2=False,
                                  verify=False, timeout=30.0) as client:
        for _ in range(5):
            await client.post(f"{base_url}/predict", json=payload)
        for _ in range(n):
            t0 = time.perf_counter()
            await client.post(f"{base_url}/predict", json=payload)
            latencies.append((time.perf_counter() - t0) * 1000)
    return latencies


def set_packet_loss(interface: str, loss_pct: float):
    """Set packet loss on interface using tc netem (requires root)."""
    subprocess.run(
        ["tc", "qdisc", "replace", "dev", interface, "root",
         "netem", "loss", f"{loss_pct}%"],
        check=True
    )


def clear_packet_loss(interface: str):
    """Remove tc packet loss simulation."""
    subprocess.run(["tc", "qdisc", "del", "dev", interface, "root"],
                   capture_output=True)


async def run_full_comparison(
    h3_url: str = "https://localhost:4433",
    h2_url: str = "https://localhost:8443",
    interface: str = "lo",
):
    """
    Run HTTP version comparison with and without packet loss.

    Expected results (100ms RTT, 62ms inference, 1% loss):
      No loss:
        HTTP/1.1  p50=165ms, p99=180ms
        HTTP/2    p50=163ms, p99=176ms
        HTTP/3    p50=162ms, p99=174ms

      1% packet loss:
        HTTP/1.1  p50=165ms, p99=420ms   (occasional retransmit)
        HTTP/2    p50=163ms, p99=680ms   (HOL blocking amplifies loss)
        HTTP/3    p50=162ms, p99=178ms   (per-stream isolation - nearly no change)
    """
    payload = {"input": [1.0, 2.0, 3.0], "model": "classifier-v2"}
    results = []

    for loss in [0.0, 1.0, 2.0]:
        if loss > 0:
            set_packet_loss(interface, loss)
            print(f"\n=== {loss}% packet loss ===")
        else:
            print("\n=== No packet loss ===")

        h1_lat = await bench_http11(h2_url, payload)
        h2_lat = await bench_http2(h2_url, payload)
        h3_lat = await bench_http3(h3_url, payload)

        for lats, proto in [(h1_lat, "HTTP/1.1"), (h2_lat, "HTTP/2"),
                             (h3_lat, "HTTP/3")]:
            r = compute_percentiles(lats, proto, loss)
            results.append(r)
            print(f"  {r.protocol}: p50={r.median_ms}ms "
                  f"p95={r.p95_ms}ms p99={r.p99_ms}ms")

    if interface != "lo":
        clear_packet_loss(interface)

    return results

QUIC for ML Inference Serving

When HTTP/3 Helps

Mobile and edge clients. Mobile networks have 1-5% packet loss on LTE and intermittent connectivity on WiFi. HTTP/3's per-stream loss isolation and connection migration make it the right choice for inference APIs that serve mobile apps.

Multi-region deployments. Transcontinental internet links have 0.5-2% packet loss on average. An inference service in US-West serving users in Europe will see p99 improvements of 2-5x under HTTP/3 versus HTTP/2 for the same p50 latency.

Streaming LLM inference. WebTransport (built on QUIC) enables bidirectional streaming with proper stream multiplexing. For streaming token generation, multiple concurrent streams do not block each other even under packet loss.

When HTTP/3 Does Not Help

For internal service-to-service communication in a datacenter, HTTP/2 is usually better. Datacenter networks have packet loss rates under 0.01%, and the overhead of QUIC's user-space congestion control is slightly higher than the kernel's TCP stack. Use HTTP/2 for gRPC inference (Triton's gRPC API, Ray Serve internal traffic).

Caddy Configuration (HTTP/3 by Default)

# Caddyfile - HTTP/3 is enabled automatically in Caddy 2.x
inference.example.com {
    # Caddy enables QUIC on UDP 443 automatically
    # No extra directives needed

    reverse_proxy localhost:8000 {
        transport http {
            versions h2  # HTTP/2 to backend (reliable internal network)
        }
    }

    # Alt-Svc tells clients to upgrade to HTTP/3 on next connection
    header Alt-Svc 'h3=":443"; ma=86400'
}

Nginx HTTP/3 Configuration

# nginx.conf - requires nginx 1.25+ compiled with --with-http_v3_module
http {
    server {
        listen 443 ssl;
        listen 443 quic reuseport;   # UDP port for QUIC

        ssl_certificate  /etc/nginx/cert.pem;
        ssl_certificate_key /etc/nginx/key.pem;
        ssl_protocols TLSv1.3;       # Required for QUIC

        # Advertise HTTP/3 to clients
        add_header Alt-Svc 'h3=":443"; ma=86400';

        quic_retry on;               # Stateless retry (prevents amplification attacks)
        quic_gso on;                 # Generic Segmentation Offload
        ssl_early_data on;           # Enable 0-RTT

        location /v1/inference {
            proxy_pass http://triton_backend;
            proxy_http_version 1.1;
        }
    }
}

Envoy HTTP/3 for Kubernetes

# envoy-quic-filter-chain.yaml (relevant excerpt)
# Envoy as HTTP/3 ingress terminating QUIC, forwarding HTTP/2 to backends
filter_chains:
- transport_socket:
    name: envoy.transport_sockets.quic
    typed_config:
      "@type": type.googleapis.com/envoy.extensions.transport_sockets.quic.v3.QuicDownstreamTransport
      downstream_tls_context:
        common_tls_context:
          tls_certificates:
          - certificate_chain: {filename: /etc/envoy/cert.pem}
            private_key: {filename: /etc/envoy/key.pem}
          alpn_protocols: ["h3"]
  filters:
  - name: envoy.filters.network.http_connection_manager
    typed_config:
      "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
      codec_type: HTTP3
      stat_prefix: ingress_http3
      route_config:
        virtual_hosts:
        - name: ml_inference
          domains: ["inference.example.com"]
          routes:
          - match: {prefix: "/v2/"}
            route:
              cluster: triton_cluster
              timeout: 300s

WebTransport for Streaming LLM Inference

WebTransport is an API built on QUIC that provides bidirectional streams. It is particularly relevant for streaming LLM token generation:

"""
streaming_inference_concept.py - WebTransport-style streaming for LLM tokens.
Production deployments use Caddy/Envoy as the WebTransport proxy.
"""

import asyncio
import json
from typing import AsyncIterator


async def generate_tokens(prompt: str) -> AsyncIterator[str]:
    """Simulate LLM token-by-token generation."""
    tokens = prompt.split() + ["is", "a", "great", "example", "."]
    for token in tokens:
        await asyncio.sleep(0.05)  # 20 tokens/second
        yield token


async def handle_webtransport_stream(reader, writer):
    """
    Each bidirectional WT stream = one inference request.
    Multiple streams on one QUIC connection - no HOL blocking between them.
    """
    request_data = await reader.read(65536)
    request = json.loads(request_data)

    token_count = 0
    max_tokens = request.get("max_tokens", 512)

    async for token in generate_tokens(request.get("prompt", "")):
        if token_count >= max_tokens:
            break
        frame = json.dumps({"token": token, "done": False}).encode()
        writer.write(len(frame).to_bytes(4, "big") + frame)
        await writer.drain()
        token_count += 1

    done_frame = json.dumps({"token": "", "done": True,
                              "total_tokens": token_count}).encode()
    writer.write(len(done_frame).to_bytes(4, "big") + done_frame)
    writer.write_eof()

Production Engineering Notes

Alt-Svc and Protocol Negotiation

Clients do not know a server supports HTTP/3 until the server tells them. The Alt-Svc header in an HTTP/1.1 or HTTP/2 response tells the client it can reach the same service over h3 (HTTP/3). The client will use HTTP/3 for subsequent connections, unlocking 0-RTT session resumption.

HTTP/2 response header:
  alt-svc: h3=":443"; ma=86400

Translation: "Try h3 on this hostname, port 443, for up to 86400 seconds (24 hours)."

QUIC and UDP Blocking in Enterprise Networks

Many enterprise firewalls block UDP on port 443, causing QUIC connections to fail silently. Well-implemented clients fall back to TCP automatically (Chrome does this). If you see anomalously low QUIC adoption in your metrics for enterprise user segments, UDP blocking is the likely cause.

Triton and vLLM HTTP/3 Support

As of 2024, Triton Inference Server and vLLM do not natively speak QUIC. The recommended architecture is a QUIC-terminating proxy (Caddy, Envoy, Nginx 1.25+) in front of the ML serving backend. The proxy speaks HTTP/3 to clients and HTTP/2 (for gRPC) or HTTP/1.1 to the backend. This is correct architecture anyway - the internal datacenter network does not benefit from QUIC.

Measuring HTTP/3 Adoption in Your Serving Logs

# Parse Nginx/Caddy access logs to measure HTTP/3 adoption rate
import re
from collections import Counter

def measure_protocol_adoption(log_file: str) -> dict:
    """
    Parse access log entries to measure HTTP version distribution.
    Nginx log format includes $server_protocol or $http2/$http3 vars.
    """
    protocol_counts = Counter()

    with open(log_file) as f:
        for line in f:
            # Look for HTTP version marker in log
            if '"HTTP/3"' in line or 'h3' in line:
                protocol_counts['HTTP/3'] += 1
            elif '"HTTP/2.0"' in line or 'h2' in line:
                protocol_counts['HTTP/2'] += 1
            elif '"HTTP/1.1"' in line:
                protocol_counts['HTTP/1.1'] += 1

    total = sum(protocol_counts.values())
    return {
        proto: {"count": count, "pct": round(count / total * 100, 1)}
        for proto, count in protocol_counts.items()
    }

:::danger 0-RTT Replay Attacks QUIC 0-RTT early data is susceptible to replay attacks. An attacker who records a 0-RTT handshake packet can replay it, causing the server to re-execute the early-data request. For ML inference endpoints that trigger side effects (charge a user, send a message, update a database), replay could cause real-world harm.

Mitigations:

Only accept idempotent operations in 0-RTT (pure read/compute, no writes)
Implement server-side nonce tracking to detect replays
For sensitive endpoints: ssl_early_data off; in Nginx, or disable 0-RTT in your QUIC configuration
Never accept financial transactions or write operations via 0-RTT early data :::

:::warning QUIC CPU Overhead at High Throughput QUIC's user-space implementation has higher CPU overhead than kernel TCP at the same throughput. At 10 Gbps+ per server, QUIC's congestion control and packet processing can consume 15-25% more CPU than equivalent HTTP/2 traffic. This matters for high-throughput embedding services or batch inference APIs.

Profile your inference server CPU with QUIC enabled before deploying to production. If CPU is the bottleneck, keep HTTP/2 for bulk batch inference and use HTTP/3 only for low-latency user-facing endpoints with variable network conditions. :::

:::warning QUIC Congestion Control Tuning QUIC uses Cubic or BBR congestion control by default, implemented in user space. Unlike TCP's kernel implementation which benefits from decades of OS-level tuning, QUIC's user-space congestion control may perform worse on high-bandwidth, low-latency datacenter links. Test with both CUBIC and BBR settings in your QUIC library and benchmark on your actual network before committing to a production configuration. :::

Interview Q&A

Q1: What is TCP head-of-line blocking, and how does QUIC solve it without eliminating reliable delivery?

TCP delivers a byte stream in strict order. When a packet is lost, TCP buffers all subsequent packets and waits for the retransmission before delivering anything to the application. In HTTP/2, which multiplexes multiple logical streams onto one TCP connection, all streams are blocked when any single packet is lost - even streams whose data arrived perfectly. QUIC implements its own stream multiplexing above UDP. QUIC still retransmits lost packets (reliable delivery is preserved), but it only delays the specific QUIC stream that was carrying data in the lost packet. Other streams continue delivering data immediately. QUIC does not eliminate retransmission - it eliminates the cross-stream blocking caused by TCP's mandatory in-order delivery guarantee.

Q2: Why does QUIC use UDP instead of defining a new OS-level transport protocol?

Building a new OS-level transport protocol would require kernel updates on every operating system and compatibility with every network middlebox (firewalls, NAT gateways, load balancers) along the path. Network middleboxes are extensively deployed and often only understand TCP and UDP - new transport protocols get blocked or corrupted. Google's early QUIC experiments confirmed this: novel transport protocols were frequently filtered or mangled in corporate networks. UDP was already universally permitted. Building QUIC as a user-space protocol on UDP meant Google could deploy it in Chrome and their servers immediately, without waiting for OS kernel updates or network infrastructure changes. The tradeoff is higher CPU overhead compared to kernel TCP, which is acceptable for most inference workloads.

Q3: Explain QUIC connection migration and why it matters for ML inference on mobile devices.

TCP connections are identified by a 4-tuple: (src IP, src port, dst IP, dst port). When a mobile device switches from WiFi to LTE, its IP address changes and all TCP connections break. Any in-flight inference requests fail and must be retried. QUIC identifies connections by a 64-bit Connection ID that is independent of IP addresses. When the network changes, the client sends a PATH_CHALLENGE frame on the new path, the server responds with PATH_RESPONSE, and the QUIC connection continues without interruption. An LLM inference request that has generated 200 of 500 tokens survives the network transition cleanly - the remaining tokens arrive on the new connection without re-queuing or retry logic needed in the application.

Q4: A user reports that HTTP/3 falls back to HTTP/2 for many corporate users. What are the likely causes?

The most common cause is enterprise firewalls blocking UDP port 443. QUIC uses UDP, and many corporate network policies block non-TCP 443 traffic. A well-implemented HTTP/3 client falls back to HTTP/2 gracefully. Other causes: the Alt-Svc advertisement is stripped by a corporate SSL inspection proxy; the server's UDP rate limiting triggers because many employees share one NAT IP; or the corporate proxy itself does not support QUIC pass-through. To diagnose, check what percentage of requests from specific ASNs (corporate IP blocks) use h3 vs h2 in your server logs, then cross-reference with whether those ASNs have known QUIC-blocking policies.

Q5: What is the performance difference between HTTP/3 and HTTP/2 for internal microservice communication in a datacenter?

For internal datacenter communication, HTTP/2 is usually preferable. Datacenter networks have extremely low packet loss rates (under 0.01%), so QUIC's HOL blocking elimination provides no practical benefit. The overhead of QUIC's user-space implementation - congestion control, packet number encryption, session ticket management - consumes more CPU than kernel TCP for the same throughput. Additionally, gRPC (the standard for internal ML microservice communication) is built on HTTP/2 with mature tooling. HTTP/3 provides its largest benefits in lossy network environments: mobile, intercontinental, or high-contention edge networks. The correct architecture is HTTP/3 at the ingress (CDN, load balancer edge) and HTTP/2 (gRPC) internally.

Q6: What is 0-RTT in QUIC, when is it safe for ML inference, and when is it not?

0-RTT allows a QUIC client that has previously connected to a server to include application data in the very first packet, before the handshake completes. The server issued a session ticket in the previous connection; the client uses this ticket to encrypt early data. The benefit: first-request latency drops by one RTT (100-300ms on intercontinental links). The risk: 0-RTT data is susceptible to replay attacks. An attacker who captures the initial packet can resend it, causing the server to process the request a second time. For pure compute inference (image classification, text embedding) with no side effects, replay only costs extra compute - acceptable. For inference that triggers downstream actions (charging a user, updating a record, sending a notification), replay could cause real harm. Disable 0-RTT for those endpoints, or implement server-side nonce tracking.

QUIC Packet Internals

Understanding QUIC's packet structure explains why it achieves its performance characteristics.

A QUIC packet consists of a header (partially encrypted) and a payload (fully encrypted with TLS 1.3). Unlike TCP, which exposes sequence numbers, window sizes, and flags to network middleboxes, QUIC hides almost all transport-layer information from anything between the endpoints.

QUIC Long Header Packet (Initial/Handshake):
+--+--+--+--+--+--+--+--+  <- Header Form bit (1) + fixed bit + type bits
|  Connection ID (8 bytes)  |  <- Unencrypted: allows routing before handshake
+--+--+--+--+--+--+--+--+
|  Version (4 bytes)        |  <- QUIC version (0x00000001 for IETF QUIC)
+--+--+--+--+--+--+--+--+
|  Token (variable)         |  <- Stateless retry token (anti-amplification)
+--+--+--+--+--+--+--+--+
|  Length (variable)        |  <- Length of remaining packet
+--+--+--+--+--+--+--+--+
|  Packet Number (1-4 bytes)|  <- Encrypted: prevents packet number surveillance
+--+--+--+--+--+--+--+--+
|  Payload (encrypted)      |  <- QUIC frames: STREAM, ACK, CRYPTO, PADDING...
|  + AEAD authentication tag|
+--+--+--+--+--+--+--+--+

QUIC Short Header Packet (1-RTT data):
+-+--+--+--+--+--+--+--+   <- Header Form bit (0, short header)
|  Destination Connection ID|  <- Enables connection migration (ID != IP)
+--+--+--+--+--+--+--+--+
|  Packet Number (encrypted)|
+--+--+--+--+--+--+--+--+
|  Payload (TLS 1.3 AEAD)  |  <- Multiple STREAM frames, one per active stream
+--+--+--+--+--+--+--+--+

One QUIC packet can contain STREAM frames for multiple streams. This is why losing a packet that contains only Stream C data does not block Stream A - the Stream A STREAM frames were in different packets.

HTTP/3 vs HTTP/2 vs HTTP/1.1 - Quantitative Comparison

Scenario	HTTP/1.1	HTTP/2	HTTP/3
First request (new connection, 100ms RTT)	+300ms overhead	+150ms overhead	+50ms overhead
First request (0-RTT resumption)	+150ms (no TLS resumption)	+50ms	0ms overhead
0% packet loss, p99 latency	Baseline	-5ms (multiplexing)	-8ms (0-RTT)
1% packet loss, p99 latency	+100-300ms	+300-700ms	+10-20ms
2% packet loss, p99 latency	+300-600ms	+700-2000ms	+20-40ms
CPU overhead per Gbps	Low (kernel TCP)	Low (kernel TCP)	15-25% higher (user-space)
Connection migration (WiFi to LTE)	Full reconnect (~500ms)	Full reconnect	Seamless (0ms)
UDP firewall blocking	N/A	N/A	Falls back to HTTP/2

The numbers make the decision rule clear: HTTP/3 wins decisively when packet loss exceeds 0.5%. It has slightly higher CPU cost and provides no benefit on reliable networks. The crossover point is roughly 0.3% packet loss - above that, HTTP/3 reduces p99 latency enough to justify the marginal CPU overhead.

Monitoring QUIC in Production

# quic_metrics_collector.py
# Parse QUIC connection logs to extract performance metrics
# Works with NGINX/Caddy access logs annotated with protocol version

import re
from collections import defaultdict
from typing import Dict, List


def parse_access_log_with_protocol(log_path: str) -> Dict[str, List[float]]:
    """
    Parse web server access logs and separate latencies by HTTP version.
    Requires access log format to include $server_protocol or similar.

    Nginx log format for QUIC:
    log_format ml_api '$remote_addr - [$time_local] "$request" '
                      '$status $body_bytes_sent $request_time '
                      '"$http3" "$http2"';
    """
    latencies_by_protocol = defaultdict(list)

    # Regex matches: status body_bytes request_time_seconds http3_flag http2_flag
    pattern = re.compile(
        r'(\d{3}) (\d+) ([\d.]+) "([^"]*)" "([^"]*)"'
    )

    with open(log_path) as f:
        for line in f:
            m = pattern.search(line)
            if not m:
                continue

            status, _, latency_s, http3, http2 = m.groups()
            latency_ms = float(latency_s) * 1000

            if http3 and http3 != "-":
                protocol = "HTTP/3"
            elif http2 and http2 != "-":
                protocol = "HTTP/2"
            else:
                protocol = "HTTP/1.1"

            latencies_by_protocol[protocol].append(latency_ms)

    return dict(latencies_by_protocol)


def compute_protocol_stats(latencies_by_protocol: Dict[str, List[float]]) -> dict:
    """Compute p50/p95/p99 per protocol for capacity planning."""
    import statistics

    stats = {}
    for protocol, latencies in latencies_by_protocol.items():
        if not latencies:
            continue
        s = sorted(latencies)
        n = len(s)
        stats[protocol] = {
            "count": n,
            "p50_ms": round(statistics.median(latencies), 2),
            "p95_ms": round(s[int(n * 0.95)], 2),
            "p99_ms": round(s[int(n * 0.99)], 2),
            "pct_of_traffic": 0.0,  # filled below
        }

    total = sum(v["count"] for v in stats.values())
    for protocol in stats:
        stats[protocol]["pct_of_traffic"] = round(
            stats[protocol]["count"] / total * 100, 1
        )

    return stats

Why This Exists​

Historical Context​

Core Concepts: The TCP Head-of-Line Blocking Problem​

QUIC's Key Features​

0-RTT Connection Establishment​

Connection Migration​

TLS 1.3 is Mandatory​

Code: aioquic Server and Client​

QUIC for ML Inference Serving​

When HTTP/3 Helps​

When HTTP/3 Does Not Help​

Caddy Configuration (HTTP/3 by Default)​

Nginx HTTP/3 Configuration​

Envoy HTTP/3 for Kubernetes​

WebTransport for Streaming LLM Inference​

Production Engineering Notes​

Alt-Svc and Protocol Negotiation​

QUIC and UDP Blocking in Enterprise Networks​

Triton and vLLM HTTP/3 Support​

Measuring HTTP/3 Adoption in Your Serving Logs​

Interview Q&A​

QUIC Packet Internals​

HTTP/3 vs HTTP/2 vs HTTP/1.1 - Quantitative Comparison​

Monitoring QUIC in Production​