Skip to main content

gRPC and Protocol Buffers

The Inference Bottleneck

Your team just deployed a transformer model that achieves state-of-the-art accuracy on your benchmark dataset. Inference time per sample is 8ms on GPU. The business expects 50,000 requests per second with P99 latency under 100ms. The math works: 50k RPS at 8ms inference time requires about 400 GPU replicas. You deploy them.

The actual P99 latency is 340ms. Not 100ms. GPU utilization is 40%. You have twice as many GPUs as you need, but the system is slower than the target. The bottleneck is not computation. It is serialization and network protocol overhead.

Your team initially built the serving API in Flask with JSON. Each inference request: serialize 512 float32 features to JSON (about 8KB), send over HTTP/1.1, deserialize JSON on the server, run inference, serialize the 1000-class probability vector to JSON (about 16KB), send response, deserialize on client. JSON parsing alone is taking 15ms per request. The HTTP/1.1 connection overhead adds another 5ms. You are spending 2.5x more time on protocol overhead than on the actual model inference.

Switching to gRPC with Protocol Buffers changes this picture dramatically. The same 512 float32 features serialize to approximately 2KB in protobuf binary format (vs 8KB in JSON). Parsing time drops from 15ms to under 1ms. HTTP/2 multiplexing eliminates per-request connection overhead. The P99 latency drops to 45ms. GPU utilization climbs to 90%. You decommission half the GPU replicas.

This lesson teaches you exactly why that transformation happens - from the binary wire format of protobuf fields, through HTTP/2 stream multiplexing, to the gRPC patterns that production ML serving systems actually use.

The lesson also covers the production details that documentation skips: how to implement interceptors for auth and distributed tracing, how to handle deadlines and cancellation correctly, how to set up health checks that load balancers actually understand, and where gRPC falls short for specific ML serving patterns and what to use instead.


Why This Exists - The Problem with REST and JSON

REST over HTTP/1.1 with JSON bodies was the default API pattern for a decade. It works. It is easy to debug with curl. Every language has JSON libraries. Browsers speak it natively. For CRUD APIs serving web applications, REST/JSON is entirely appropriate.

For ML inference pipelines, REST/JSON has three structural problems:

Problem 1 - Serialization overhead: JSON is a text format. A float32 value 0.123456789 encodes as 11 bytes in JSON. In binary protobuf, the same value encodes as exactly 4 bytes. A 512-element float32 feature vector: 8KB in JSON, 2KB in protobuf. The 4x size difference multiplies directly into network transfer time and memory allocation costs.

Problem 2 - HTTP/1.1 head-of-line blocking: HTTP/1.1 processes requests sequentially on one connection. To make multiple concurrent requests, you need multiple connections. A typical serving client pool opens 50-100 TCP connections to handle concurrent requests, each paying the handshake cost and consuming file descriptors.

Problem 3 - No streaming semantics: REST is request-response. You send a complete request, you get a complete response. For ML use cases like streaming token generation (where you want to return tokens as they are generated), or for sending a continuous stream of images to a video analysis model, REST has no native mechanism. You resort to polling, WebSockets (which have their own complexity), or server-sent events (one-directional only).

gRPC solves all three. Protocol Buffers handle serialization. HTTP/2 handles multiplexing and streaming. The IDL (Interface Definition Language) provides a strongly typed contract between services.


Historical Context - From Internal Google Tool to Industry Standard

gRPC was developed at Google as the evolution of an internal RPC framework called Stubby, which had been in production at Google since around 2001. Stubby handled Google's internal service-to-service communication at massive scale but was tightly coupled to Google's internal infrastructure.

In 2015, Google open-sourced gRPC as a re-implementation of Stubby built on standard technologies: HTTP/2 (finalized as RFC 7540 in May 2015) and Protocol Buffers (version 3, released in 2016). The timing was deliberate - gRPC was designed to leverage the newly standardized HTTP/2 transport.

Protocol Buffers (protobuf) have a longer history at Google, in use since 2001. The name comes from the idea that you are defining the protocol for buffering messages between services. Proto3, the version used with gRPC, simplified proto2 by removing required fields (a common source of versioning issues) and making all fields optional by default.

The ecosystem grew rapidly. By 2020, gRPC was the dominant RPC framework for microservices at companies like Netflix, Square, Lyft, and Cloudflare. The ML serving community adopted it heavily because the Tensorflow Serving project was one of the first major ML serving systems to use gRPC natively. Today, NVIDIA Triton, TorchServe (via extensions), and most modern ML serving frameworks support gRPC as a first-class option.


Core Concepts

Protocol Buffers - Wire Format

Before understanding gRPC, you need to understand what protobuf actually does to your data. The efficiency gains are not magic - they come from a specific binary encoding that is worth understanding.

A protobuf message is defined in a .proto file:

message InferenceRequest {
int32 model_version = 1;
repeated float features = 2;
string request_id = 3;
}

The numbers (= 1, = 2, = 3) are field numbers - they appear in the binary encoding, not the field names. This is crucial: if you rename a field, existing binary messages still work because the wire format uses numbers, not names.

The binary encoding of each field is a tag-value pair:

tag=(field_number3)wire_type\text{tag} = (\text{field\_number} \ll 3) | \text{wire\_type}

Wire types:

  • 0: Varint (int32, bool, enum)
  • 1: 64-bit (fixed64, double)
  • 2: Length-delimited (string, bytes, embedded messages, repeated fields)
  • 5: 32-bit (fixed32, float)

Varint encoding is how protobuf achieves compact integer encoding. Small integers encode in fewer bytes than large ones:

  • Value 1: encoded as 0x01 (1 byte)
  • Value 300: encoded as 0xAC 0x02 (2 bytes)
  • Value 2^21 = 2,097,152: encoded as 4 bytes

For the model_version = 1 field (field number 1, varint wire type):

  • Tag: (1 << 3) | 0 = 0x08
  • Value: 0x01
  • Total: 2 bytes

Compare this to JSON: "model_version": 1 - 17 bytes including the key.

gRPC Architecture - HTTP/2 and Multiplexing

gRPC runs over HTTP/2 (RFC 7540). Understanding why HTTP/2 matters for ML serving requires understanding what HTTP/1.1 gets wrong.

HTTP/1.1 is fundamentally sequential on a single connection. Each request must complete before the next can start (without pipelining, which is unreliable in practice). To make 10 concurrent requests, clients open 10 TCP connections, each with its own handshake overhead and kernel resources.

HTTP/2 introduces streams - multiple logical request-response pairs multiplexed over a single TCP connection. A single HTTP/2 connection between an ML client and model server can handle hundreds of concurrent inference requests simultaneously, with:

  • Stream multiplexing: request and response frames from different RPCs interleaved on one TCP connection
  • Header compression (HPACK): repeated headers (like authorization tokens) are not retransmitted
  • Binary framing: the HTTP/2 protocol itself is binary, reducing parsing overhead vs the text-based HTTP/1.1

For ML serving, this means a single persistent HTTP/2 connection replaces the 50-100 connection pool you needed with HTTP/1.1. Less overhead, fewer kernel resources, no connection management logic.

The Four gRPC Service Types

gRPC supports four communication patterns. Each maps to a different ML serving use case:

1. Unary RPC: Standard request-response. Client sends one request, server sends one response. Use for: standard inference (classify this image, embed this text, predict this tabular row).

2. Server Streaming: Client sends one request, server sends a stream of responses. Use for: LLM text generation (stream tokens back as they are generated), returning ranked search results progressively.

3. Client Streaming: Client sends a stream of requests, server sends one response. Use for: sending a sequence of audio frames for speech recognition, sending video frames for action recognition.

4. Bidirectional Streaming: Both client and server send streams independently. Use for: continuous video analysis with per-frame predictions, real-time trading signal generation.

gRPC Deadlines and Cancellation

gRPC has first-class support for deadlines - a timestamp by which the entire RPC chain must complete. Deadlines propagate automatically across service-to-service calls.

This is critical for ML serving. Without deadlines, a slow model or database call can cause requests to pile up, consuming resources indefinitely and causing cascading failures. With deadlines, slow requests are cancelled and resources are freed.

import grpc

# Client sets a deadline
channel = grpc.insecure_channel("model-server:50051")
stub = InferenceServiceStub(channel)

# This RPC must complete within 500ms or it is cancelled
try:
response = stub.Predict(
request,
timeout=0.5 # 500ms deadline
)
except grpc.RpcError as e:
if e.code() == grpc.StatusCode.DEADLINE_EXCEEDED:
# Request timed out - return cached result or error
pass

The server can also check if the client has cancelled:

def Predict(self, request, context):
# Periodically check if client is still waiting
if context.is_active() is False:
return PredictResponse() # Client cancelled - no point continuing

features = preprocess(request.features)

# Check again after preprocessing
if not context.is_active():
return PredictResponse()

prediction = self.model(features)
return PredictResponse(predictions=prediction.tolist())

gRPC vs REST for ML Serving

The choice is not universal. Understanding the tradeoffs:

DimensiongRPCREST/JSON
Serialization size2-5x smallerBaseline
Serialization speed5-10x fasterBaseline
Browser supportNeeds gRPC-web proxyNative
DebuggabilityRequires protobuf toolscurl, Postman
StreamingFirst-class (4 types)Limited (SSE, WebSocket)
Load balancer supportNeeds HTTP/2-aware LBUniversal
Code generationRequiredOptional
VersioningField numbers, backward compatURL versioning, schema optional

Rule of thumb for ML: use gRPC for service-to-service communication in your ML platform (training coordinator to workers, serving client to model server). Use REST for public-facing APIs where client diversity and debuggability matter.


Code Examples

Complete Proto Definition for ML Inference

// inference.proto
// Run: python -m grpc_tools.protoc -I. --python_out=. --grpc_python_out=. inference.proto

syntax = "proto3";

package inference;

// Request for a single prediction
message PredictRequest {
// Unique identifier for request tracing
string request_id = 1;

// Model to use for inference
string model_name = 2;
int32 model_version = 3;

// Input features as flat float array
// For 2D inputs (e.g., batch), use shape to reshape
repeated float features = 4;
repeated int32 feature_shape = 5;

// Optional: raw bytes input (images, audio)
bytes raw_input = 6;
string input_format = 7; // "jpeg", "png", "wav", etc.

// Deadline for this specific prediction
int64 deadline_ms = 8;
}

// Response for a single prediction
message PredictResponse {
string request_id = 1;

// Output probabilities or regression values
repeated float predictions = 2;
repeated int32 prediction_shape = 3;

// Top-k class labels with scores
repeated ClassScore top_k = 4;

// Metadata
string model_name = 5;
int32 model_version = 6;
float inference_latency_ms = 7;
}

message ClassScore {
string label = 1;
float score = 2;
int32 class_id = 3;
}

// Embedding request - for vector search / RAG systems
message EmbedRequest {
string request_id = 1;
string text = 2;
string model_name = 3;
bool normalize = 4; // L2-normalize the output vector
}

message EmbedResponse {
string request_id = 1;
repeated float embedding = 2;
int32 dimension = 3;
}

// Streaming generation request - for LLMs
message GenerateRequest {
string request_id = 1;
string prompt = 2;
string model_name = 3;
int32 max_tokens = 4;
float temperature = 5;
float top_p = 6;
}

// One token at a time in streaming response
message GenerateToken {
string request_id = 1;
string token = 2;
int32 token_id = 3;
float log_prob = 4;
bool is_final = 5; // True on the last token
string finish_reason = 6; // "stop", "length", "error"
}

// Batch request - more efficient than N unary calls
message BatchPredictRequest {
repeated PredictRequest requests = 1;
// If true, server pads/groups requests for optimal GPU batching
bool allow_dynamic_batching = 2;
}

message BatchPredictResponse {
repeated PredictResponse responses = 1;
}

// The inference service definition
service InferenceService {
// Standard single prediction
rpc Predict(PredictRequest) returns (PredictResponse);

// Get text embedding
rpc Embed(EmbedRequest) returns (EmbedResponse);

// Stream tokens for LLM generation
rpc Generate(GenerateRequest) returns (stream GenerateToken);

// Efficient batch prediction
rpc BatchPredict(BatchPredictRequest) returns (BatchPredictResponse);

// Bidirectional streaming - continuous prediction pipeline
rpc StreamingPredict(stream PredictRequest) returns (stream PredictResponse);
}

Python gRPC Server

"""
gRPC inference server implementing the InferenceService definition.
Demonstrates: proper error handling, deadlines, metadata, and
server-side streaming for LLM token generation.
"""
import time
import logging
import threading
from concurrent import futures
from typing import Iterator

import grpc
from grpc_health.v1 import health_pb2, health_pb2_grpc
from grpc_health.v1.health import HealthServicer

# Generated from inference.proto
import inference_pb2
import inference_pb2_grpc

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


class InferenceServicer(inference_pb2_grpc.InferenceServiceServicer):
"""
Production gRPC inference server.

Key design decisions:
- Check context.is_active() to avoid wasted work on cancelled RPCs
- Set appropriate gRPC status codes on errors (not just raise exceptions)
- Log request_id for distributed tracing correlation
- Validate inputs and return INVALID_ARGUMENT for bad requests
"""

def __init__(self, model_path: str):
self.model_path = model_path
self.model = self._load_model(model_path)
self._request_count = 0
self._lock = threading.Lock()

def _load_model(self, path: str):
"""Load model - replace with your actual model loading."""
logger.info(f"Loading model from {path}")
# import torch; return torch.load(path)
return None # placeholder

def Predict(
self,
request: inference_pb2.PredictRequest,
context: grpc.ServicerContext,
) -> inference_pb2.PredictResponse:
"""
Unary RPC: single prediction.
Handles input validation, deadline checking, and error propagation.
"""
start_time = time.perf_counter()

# Track request count for metrics
with self._lock:
self._request_count += 1

# Input validation - return INVALID_ARGUMENT, not INTERNAL
if not request.features:
context.abort(
grpc.StatusCode.INVALID_ARGUMENT,
"features field is required and must not be empty"
)
return inference_pb2.PredictResponse()

if len(request.features) != 512: # example expected size
context.abort(
grpc.StatusCode.INVALID_ARGUMENT,
f"Expected 512 features, got {len(request.features)}"
)
return inference_pb2.PredictResponse()

# Check if client is still waiting before expensive compute
if not context.is_active():
logger.info(f"[{request.request_id}] Client cancelled before inference")
return inference_pb2.PredictResponse()

try:
# Actual inference
features = list(request.features)
predictions = self._run_inference(features)

# Check again after inference (might have timed out during compute)
if not context.is_active():
logger.info(f"[{request.request_id}] Client cancelled after inference")
return inference_pb2.PredictResponse()

latency_ms = (time.perf_counter() - start_time) * 1000

# Add response metadata for tracing
context.set_trailing_metadata([
("x-model-version", request.model_version.__str__()),
("x-inference-latency-ms", f"{latency_ms:.2f}"),
])

return inference_pb2.PredictResponse(
request_id=request.request_id,
predictions=predictions,
model_name=request.model_name,
model_version=request.model_version,
inference_latency_ms=latency_ms,
)

except MemoryError:
logger.error(f"[{request.request_id}] OOM during inference")
context.abort(
grpc.StatusCode.RESOURCE_EXHAUSTED,
"Insufficient memory for inference"
)
return inference_pb2.PredictResponse()

except Exception as e:
logger.exception(f"[{request.request_id}] Inference failed")
context.abort(
grpc.StatusCode.INTERNAL,
f"Inference failed: {str(e)}"
)
return inference_pb2.PredictResponse()

def Generate(
self,
request: inference_pb2.GenerateRequest,
context: grpc.ServicerContext,
) -> Iterator[inference_pb2.GenerateToken]:
"""
Server streaming RPC: LLM token generation.
Yields one token at a time as the model generates them.
Client receives tokens as they arrive - no waiting for full response.
"""
logger.info(
f"[{request.request_id}] Starting generation: "
f"prompt='{request.prompt[:50]}...' max_tokens={request.max_tokens}"
)

tokens_generated = 0

# Simulate token-by-token generation
# Replace with actual LLM inference loop
simulated_tokens = ["The", " quick", " brown", " fox", " jumps", "."]

for token_text in simulated_tokens:
# Check if client is still listening before generating more tokens
if not context.is_active():
logger.info(f"[{request.request_id}] Client disconnected at token {tokens_generated}")
return

# Check max_tokens limit
if tokens_generated >= request.max_tokens:
yield inference_pb2.GenerateToken(
request_id=request.request_id,
token="",
is_final=True,
finish_reason="length",
)
return

tokens_generated += 1
is_final = (token_text == simulated_tokens[-1])

yield inference_pb2.GenerateToken(
request_id=request.request_id,
token=token_text,
token_id=tokens_generated,
log_prob=-0.5,
is_final=is_final,
finish_reason="stop" if is_final else "",
)

time.sleep(0.02) # Simulate 20ms per token generation time

logger.info(
f"[{request.request_id}] Generation complete: {tokens_generated} tokens"
)

def StreamingPredict(
self,
request_iterator,
context: grpc.ServicerContext,
) -> Iterator[inference_pb2.PredictResponse]:
"""
Bidirectional streaming: continuous prediction pipeline.
Use for video analysis, audio processing, or any continuous stream.
"""
for request in request_iterator:
if not context.is_active():
return

# Process each request and yield response immediately
predictions = self._run_inference(list(request.features))
yield inference_pb2.PredictResponse(
request_id=request.request_id,
predictions=predictions,
)

def _run_inference(self, features: list) -> list:
"""Placeholder - replace with real model inference."""
time.sleep(0.008) # 8ms simulated inference
return [0.1, 0.7, 0.2] # simulated class probabilities


def create_server(
model_path: str,
port: int = 50051,
max_workers: int = 10,
) -> grpc.Server:
"""
Create and configure a production gRPC server.

Key settings:
- ThreadPoolExecutor with appropriate worker count
- Maximum message size for large tensors
- Health check service for load balancer integration
"""
server = grpc.server(
futures.ThreadPoolExecutor(max_workers=max_workers),
options=[
# Allow large messages for model checkpoints and batch inference
("grpc.max_send_message_length", 100 * 1024 * 1024), # 100MB
("grpc.max_receive_message_length", 100 * 1024 * 1024), # 100MB
# Keepalive: detect dead clients after 30 seconds
("grpc.keepalive_time_ms", 30000),
("grpc.keepalive_timeout_ms", 5000),
("grpc.keepalive_permit_without_calls", True),
# Allow clients to send more concurrent RPCs
("grpc.http2.max_pings_without_data", 0),
],
)

# Register inference service
servicer = InferenceServicer(model_path)
inference_pb2_grpc.add_InferenceServiceServicer_to_server(servicer, server)

# Register health check service (required for K8s readiness probes and LB)
health_servicer = HealthServicer()
health_pb2_grpc.add_HealthServicer_to_server(health_servicer, server)

# Mark service as SERVING
health_servicer.set(
"inference.InferenceService",
health_pb2.HealthCheckResponse.SERVING
)

server.add_insecure_port(f"[::]:{port}")

logger.info(f"gRPC server configured on port {port}")
return server, health_servicer


if __name__ == "__main__":
server, health_servicer = create_server(
model_path="/models/classifier/v1",
port=50051,
max_workers=10,
)

server.start()
logger.info("Server started, waiting for requests")

try:
server.wait_for_termination()
except KeyboardInterrupt:
logger.info("Shutting down gracefully...")
# Mark as NOT_SERVING so load balancer stops sending traffic
health_servicer.set(
"inference.InferenceService",
health_pb2.HealthCheckResponse.NOT_SERVING
)
# Give in-flight requests 30 seconds to complete
server.stop(grace=30)

Python gRPC Client

"""
Production gRPC client for ML inference.
Demonstrates: connection management, deadline handling,
retry logic, and streaming consumption.
"""
import time
import logging
from typing import Iterator

import grpc
import inference_pb2
import inference_pb2_grpc

logger = logging.getLogger(__name__)


class InferenceClient:
"""
gRPC client for ML inference service.

Uses a single persistent channel per server address.
HTTP/2 multiplexing allows many concurrent RPCs over this one channel.
No need for a connection pool like TCP requires.
"""

def __init__(
self,
server_address: str,
timeout_ms: float = 500,
use_tls: bool = False,
tls_cert_path: str = None,
):
self.server_address = server_address
self.timeout_sec = timeout_ms / 1000.0

# Create persistent channel - reused for all RPCs
if use_tls and tls_cert_path:
with open(tls_cert_path, "rb") as f:
credentials = grpc.ssl_channel_credentials(f.read())
self.channel = grpc.secure_channel(
server_address,
credentials,
options=self._channel_options(),
)
else:
self.channel = grpc.insecure_channel(
server_address,
options=self._channel_options(),
)

self.stub = inference_pb2_grpc.InferenceServiceStub(self.channel)

def _channel_options(self) -> list:
"""gRPC channel options optimized for ML serving."""
return [
# Large messages for batch inference and checkpoint transfer
("grpc.max_send_message_length", 100 * 1024 * 1024),
("grpc.max_receive_message_length", 100 * 1024 * 1024),
# Keepalive: detect dead server within 35 seconds
("grpc.keepalive_time_ms", 30000),
("grpc.keepalive_timeout_ms", 5000),
# Enable keepalive even when no active RPCs
("grpc.keepalive_permit_without_calls", True),
# Allow server to push keepalive without data
("grpc.http2.max_pings_without_data", 0),
]

def predict(
self,
features: list,
model_name: str = "default",
request_id: str = None,
metadata: list = None,
) -> inference_pb2.PredictResponse:
"""
Single prediction with timeout.
Raises grpc.RpcError on failure - callers should handle:
- DEADLINE_EXCEEDED: request took too long
- UNAVAILABLE: server down or overloaded
- INVALID_ARGUMENT: bad input data
"""
import uuid
request = inference_pb2.PredictRequest(
request_id=request_id or str(uuid.uuid4()),
model_name=model_name,
features=features,
)

# Add auth token or trace headers as metadata
call_metadata = metadata or []
call_metadata.append(("x-client-id", "ml-client-v1"))

try:
response = self.stub.Predict(
request,
timeout=self.timeout_sec,
metadata=call_metadata,
)
return response

except grpc.RpcError as e:
status = e.code()
if status == grpc.StatusCode.DEADLINE_EXCEEDED:
logger.warning(
f"Prediction timed out after {self.timeout_sec*1000:.0f}ms"
f" [request_id={request.request_id}]"
)
elif status == grpc.StatusCode.UNAVAILABLE:
logger.error(f"Model server unavailable: {e.details()}")
raise

def generate_stream(
self,
prompt: str,
model_name: str = "llm-v1",
max_tokens: int = 512,
temperature: float = 0.7,
) -> Iterator[str]:
"""
Stream LLM token generation.
Yields tokens as they arrive - use for real-time output display.

Usage:
for token in client.generate_stream("Explain neural networks"):
print(token, end="", flush=True)
"""
import uuid
request = inference_pb2.GenerateRequest(
request_id=str(uuid.uuid4()),
prompt=prompt,
model_name=model_name,
max_tokens=max_tokens,
temperature=temperature,
)

try:
# No timeout on streaming - use max_tokens as implicit limit
for token_response in self.stub.Generate(request):
yield token_response.token

if token_response.is_final:
logger.info(
f"Generation complete: finish_reason={token_response.finish_reason}"
)
break

except grpc.RpcError as e:
if e.code() == grpc.StatusCode.CANCELLED:
logger.info("Generation cancelled by client")
else:
logger.error(f"Generation failed: {e.code()}: {e.details()}")
raise

def check_health(self) -> bool:
"""Check if server is ready to serve requests."""
from grpc_health.v1 import health_pb2, health_pb2_grpc

health_stub = health_pb2_grpc.HealthStub(self.channel)
try:
response = health_stub.Check(
health_pb2.HealthCheckRequest(
service="inference.InferenceService"
),
timeout=5.0,
)
return response.status == health_pb2.HealthCheckResponse.SERVING
except grpc.RpcError:
return False

def close(self):
"""Close the gRPC channel."""
self.channel.close()

gRPC Interceptors for Auth, Logging, and Retries

"""
gRPC interceptors for production ML serving.
Interceptors implement cross-cutting concerns without modifying service logic.

Three interceptors demonstrated:
1. AuthInterceptor: validate Bearer tokens
2. LoggingInterceptor: structured request/response logging
3. RetryInterceptor: automatic retry on transient failures
"""
import time
import logging
import functools
from typing import Callable

import grpc

logger = logging.getLogger(__name__)


class ServerAuthInterceptor(grpc.ServerInterceptor):
"""
Validate Bearer tokens on all incoming RPCs.
Returns UNAUTHENTICATED if token is missing or invalid.

In production: replace _validate_token with JWT validation,
API key lookup, or call to your auth service.
"""

SKIP_AUTH_METHODS = {"/grpc.health.v1.Health/Check"}

def __init__(self, valid_tokens: set[str]):
self.valid_tokens = valid_tokens

def intercept_service(self, continuation, handler_call_details):
# Skip auth for health checks (called by load balancers)
if handler_call_details.method in self.SKIP_AUTH_METHODS:
return continuation(handler_call_details)

# Extract Bearer token from metadata
metadata = dict(handler_call_details.invocation_metadata)
auth_header = metadata.get("authorization", "")

if not auth_header.startswith("Bearer "):
return self._unauthenticated("Missing Bearer token")

token = auth_header[7:] # Strip "Bearer " prefix
if not self._validate_token(token):
return self._unauthenticated("Invalid or expired token")

return continuation(handler_call_details)

def _validate_token(self, token: str) -> bool:
"""Replace with JWT validation or auth service call."""
return token in self.valid_tokens

def _unauthenticated(self, message: str):
"""Return an RPC handler that immediately aborts with UNAUTHENTICATED."""
def abort(request, context):
context.abort(grpc.StatusCode.UNAUTHENTICATED, message)
return grpc.unary_unary_rpc_method_handler(abort)


class ServerLoggingInterceptor(grpc.ServerInterceptor):
"""
Structured logging for all RPCs.
Logs: method, request_id (from request or metadata), latency, status.
Essential for debugging production ML serving issues.
"""

def intercept_service(self, continuation, handler_call_details):
handler = continuation(handler_call_details)
if handler is None:
return handler

method = handler_call_details.method

if handler.unary_unary:
handler = handler._replace(
unary_unary=self._wrap_unary(handler.unary_unary, method)
)

return handler

def _wrap_unary(self, handler: Callable, method: str) -> Callable:
@functools.wraps(handler)
def wrapper(request, context):
start_time = time.perf_counter()
request_id = getattr(request, "request_id", "unknown")

# Extract trace ID from metadata if present
metadata = dict(context.invocation_metadata())
trace_id = metadata.get("x-trace-id", "none")

try:
response = handler(request, context)
latency_ms = (time.perf_counter() - start_time) * 1000

logger.info(
"grpc_request",
extra={
"method": method,
"request_id": request_id,
"trace_id": trace_id,
"latency_ms": f"{latency_ms:.2f}",
"status": "OK",
}
)
return response

except Exception as e:
latency_ms = (time.perf_counter() - start_time) * 1000
logger.error(
"grpc_request_error",
extra={
"method": method,
"request_id": request_id,
"trace_id": trace_id,
"latency_ms": f"{latency_ms:.2f}",
"status": "ERROR",
"error": str(e),
}
)
raise

return wrapper


class ClientRetryInterceptor(grpc.UnaryUnaryClientInterceptor):
"""
Client-side retry for transient gRPC failures.
Retries on UNAVAILABLE and DEADLINE_EXCEEDED with exponential backoff.

Do NOT retry on INVALID_ARGUMENT or NOT_FOUND - these are permanent errors.
"""

RETRYABLE_STATUS_CODES = {
grpc.StatusCode.UNAVAILABLE,
grpc.StatusCode.RESOURCE_EXHAUSTED, # Server overloaded - backoff and retry
}

def __init__(self, max_retries: int = 3, initial_backoff_ms: float = 100):
self.max_retries = max_retries
self.initial_backoff_ms = initial_backoff_ms

def intercept_unary_unary(self, continuation, client_call_details, request):
last_error = None

for attempt in range(self.max_retries + 1):
try:
return continuation(client_call_details, request)

except grpc.RpcError as e:
last_error = e

if e.code() not in self.RETRYABLE_STATUS_CODES:
raise # Non-retryable error - fail immediately

if attempt >= self.max_retries:
break # Exhausted retries

# Exponential backoff: 100ms, 200ms, 400ms
backoff_ms = self.initial_backoff_ms * (2 ** attempt)
logger.warning(
f"RPC attempt {attempt + 1} failed with {e.code()}, "
f"retrying in {backoff_ms:.0f}ms"
)
time.sleep(backoff_ms / 1000.0)

raise last_error


def create_server_with_interceptors(
servicer,
valid_tokens: set[str],
port: int = 50051,
) -> grpc.Server:
"""Create gRPC server with auth and logging interceptors."""
from concurrent import futures

auth_interceptor = ServerAuthInterceptor(valid_tokens)
logging_interceptor = ServerLoggingInterceptor()

server = grpc.server(
futures.ThreadPoolExecutor(max_workers=10),
interceptors=[auth_interceptor, logging_interceptor],
)

return server


def create_client_with_retry(server_address: str) -> grpc.Channel:
"""Create gRPC channel with client-side retry interceptor."""
retry_interceptor = ClientRetryInterceptor(max_retries=3, initial_backoff_ms=100)

channel = grpc.intercept_channel(
grpc.insecure_channel(server_address),
retry_interceptor,
)
return channel

Protobuf Serialization Benchmark

"""
Benchmark: protobuf vs JSON for ML inference payloads.
Demonstrates the concrete serialization advantage of protobuf.
"""
import json
import time
import struct

import inference_pb2 # Generated from inference.proto


def benchmark_serialization(num_features: int = 512, iterations: int = 10000):
"""
Compare JSON vs protobuf for a typical ML inference payload.
Tests both serialization (encode) and deserialization (decode).
"""
# Generate realistic feature vector
import random
features = [random.random() for _ in range(num_features)]
request_id = "benchmark-request-001"

# --- Protobuf ---
proto_request = inference_pb2.PredictRequest(
request_id=request_id,
model_name="classifier-v1",
model_version=1,
features=features,
)

proto_serialized = proto_request.SerializeToString()

# Serialize benchmark
start = time.perf_counter()
for _ in range(iterations):
proto_request.SerializeToString()
proto_encode_ms = (time.perf_counter() - start) / iterations * 1000

# Deserialize benchmark
start = time.perf_counter()
for _ in range(iterations):
inference_pb2.PredictRequest.FromString(proto_serialized)
proto_decode_ms = (time.perf_counter() - start) / iterations * 1000

# --- JSON ---
json_payload = {
"request_id": request_id,
"model_name": "classifier-v1",
"model_version": 1,
"features": features,
}

json_serialized = json.dumps(json_payload).encode("utf-8")

start = time.perf_counter()
for _ in range(iterations):
json.dumps(json_payload).encode("utf-8")
json_encode_ms = (time.perf_counter() - start) / iterations * 1000

start = time.perf_counter()
for _ in range(iterations):
json.loads(json_serialized)
json_decode_ms = (time.perf_counter() - start) / iterations * 1000

# Results
proto_size = len(proto_serialized)
json_size = len(json_serialized)

print(f"\nSerialization Benchmark ({num_features} features, {iterations} iterations)")
print("=" * 60)
print(f"{'Metric':<25} {'Protobuf':>12} {'JSON':>12} {'Speedup':>10}")
print("-" * 60)
print(f"{'Message size (bytes)':<25} {proto_size:>12} {json_size:>12} {json_size/proto_size:>9.1f}x")
print(f"{'Encode time (ms/req)':<25} {proto_encode_ms:>12.3f} {json_encode_ms:>12.3f} {json_encode_ms/proto_encode_ms:>9.1f}x")
print(f"{'Decode time (ms/req)':<25} {proto_decode_ms:>12.3f} {json_decode_ms:>12.3f} {json_decode_ms/proto_decode_ms:>9.1f}x")
print()
print(f"At 10k RPS, serialization savings per second:")
print(f" Bandwidth: {(json_size - proto_size) * 10000 / 1e6:.1f} MB/s less network traffic")
print(f" CPU: {(json_encode_ms + json_decode_ms - proto_encode_ms - proto_decode_ms) * 10:.1f}ms/s saved on serialization")


if __name__ == "__main__":
benchmark_serialization(num_features=512, iterations=10000)

Production Engineering Notes

gRPC Load Balancing Requires HTTP/2-Aware Proxy

Standard L4 load balancers (AWS NLB, hardware LBs) work at the TCP level. They route new TCP connections to backend servers. But gRPC uses HTTP/2, which multiplexes many RPCs over one TCP connection. If you put 10 gRPC servers behind an L4 load balancer, all requests end up on the server that received the first TCP connection - the others sit idle.

The solution is an L7 (application-layer) load balancer that understands HTTP/2 streams:

# Kubernetes Ingress with NGINX - L7 gRPC load balancing
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: grpc-inference-ingress
annotations:
nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
nginx.ingress.kubernetes.io/grpc-backend: "true"
# Round-robin across HTTP/2 streams, not connections
nginx.ingress.kubernetes.io/upstream-hash-by: "$request_uri"
spec:
rules:
- host: inference.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: inference-service
port:
number: 50051

Alternatively, use Envoy proxy (which gRPC was designed to work with) or Istio (which uses Envoy as its data plane).

Health Checking for Kubernetes Integration

"""
Proper gRPC health checking for Kubernetes liveness and readiness probes.
grpc-health-probe is the standard tool for K8s gRPC health checks.
"""
from grpc_health.v1 import health_pb2, health_pb2_grpc
from grpc_health.v1.health import HealthServicer


class MLModelHealthServicer(HealthServicer):
"""
Extended health servicer that checks actual model readiness.
Not just "is the server running" but "is the model loaded and ready".
"""

def __init__(self, model_ref):
super().__init__()
self.model_ref = model_ref

def Check(self, request, context):
# Check model is actually loaded and can run inference
service = request.service

if service == "inference.InferenceService":
if self.model_ref.is_ready():
status = health_pb2.HealthCheckResponse.SERVING
else:
# Model still loading - Kubernetes will not send traffic
status = health_pb2.HealthCheckResponse.NOT_SERVING
else:
status = health_pb2.HealthCheckResponse.UNKNOWN

return health_pb2.HealthCheckResponse(status=status)

Kubernetes probe configuration:

# K8s deployment spec for gRPC ML serving
readinessProbe:
exec:
command:
- /bin/grpc_health_probe
- -addr=:50051
- -service=inference.InferenceService
initialDelaySeconds: 30 # Time for model to load
periodSeconds: 10
failureThreshold: 3

livenessProbe:
exec:
command:
- /bin/grpc_health_probe
- -addr=:50051
initialDelaySeconds: 60
periodSeconds: 30

Envoy as gRPC Proxy

Envoy is the preferred sidecar proxy for gRPC because it was designed with gRPC in mind. Key features for ML serving:

  • gRPC transcoding: translate REST/JSON requests to gRPC/protobuf transparently
  • Load balancing: per-RPC (not per-connection) load balancing
  • Circuit breaking: automatically stop sending to unhealthy backends
  • Retries: configurable retry policy per route
  • Observability: built-in metrics for gRPC status codes, latency histograms

Common Mistakes

:::danger Not Setting gRPC Deadlines in ML Serving

gRPC RPCs without deadlines can hang indefinitely if the server is slow or stuck. In ML serving, a model loaded on a memory-constrained GPU can take 10+ seconds to process a request during peak load. Without deadlines, all calling services accumulate blocked threads waiting for responses that may never arrive.

The failure mode is cascading: the inference service is slow, the feature service blocks waiting for it, the API gateway blocks waiting for the feature service, and eventually the entire request processing pipeline is backed up with connections waiting for a single slow model server.

Always set a deadline on every gRPC call:

# Wrong - no deadline
response = stub.Predict(request)

# Correct - 500ms deadline
response = stub.Predict(request, timeout=0.5)

Set deadlines based on your SLA, not on "how long you think it should take." :::

:::danger Ignoring gRPC Status Codes - Treating All Errors as Retryable

gRPC defines specific status codes for a reason. INVALID_ARGUMENT means the request was malformed - retrying will not help. UNAVAILABLE means the server is temporarily down - retry is appropriate. Treating all errors as retryable causes retry storms that make overloaded servers worse.

# Wrong - retry everything
for _ in range(3):
try:
response = stub.Predict(request)
break
except grpc.RpcError:
time.sleep(0.1)

# Correct - only retry transient errors
RETRYABLE = {grpc.StatusCode.UNAVAILABLE, grpc.StatusCode.RESOURCE_EXHAUSTED}

for attempt in range(3):
try:
response = stub.Predict(request, timeout=0.5)
break
except grpc.RpcError as e:
if e.code() not in RETRYABLE or attempt == 2:
raise
time.sleep(0.1 * (2 ** attempt))

:::

:::warning Proto Field Number Changes Break Binary Compatibility

Changing a field number in a .proto file is a binary-incompatible change. Any serialized messages with the old field numbers will be misinterpreted when deserialized with the new schema.

The rule: never change a field number once it has been deployed to production. If you want to rename a field, change the name only - the field number stays the same and binary compatibility is preserved. If you want to change a field type, add a new field with a new number and deprecate the old one.

// WRONG - never do this in production
message OldRequest {
repeated float features = 2; // Changed from field 1 to field 2!
}

// CORRECT - rename is fine (numbers unchanged), add new fields
message Request {
string request_id = 1; // OK to rename
repeated float features = 2; // Number unchanged - safe
string model_name = 3; // New field - safe
// int32 old_field = 4; // Deprecated - reserve the number
reserved 4; // Prevents future reuse
reserved "old_field";
}

:::

:::warning gRPC Channel Not Reused - One Channel Per Request

Creating a new gRPC channel for every request defeats the purpose of HTTP/2 multiplexing. Channel creation involves a TCP handshake, TLS negotiation (if used), and HTTP/2 connection setup. This is even more expensive than a plain TCP handshake.

# Wrong - new channel per request
def predict(features):
channel = grpc.insecure_channel("model-server:50051")
stub = InferenceServiceStub(channel)
result = stub.Predict(request)
channel.close() # Wasteful!
return result

# Correct - create channel once, reuse
class InferenceClient:
def __init__(self, address):
self.channel = grpc.insecure_channel(address)
self.stub = InferenceServiceStub(self.channel)

def predict(self, features):
return self.stub.Predict(request) # Reuses existing channel

Create one channel per server address per process. For multiple server addresses, create one channel per address and use a client-side load balancer. :::


Interview Q&A

Q1: Explain the Protocol Buffers wire format and why it is more efficient than JSON for ML payloads.

Protocol Buffers use a binary tag-value encoding. Each field is encoded as a tag followed by the value. The tag combines the field number (defined in the .proto file) with a wire type (varint, 64-bit, length-delimited, or 32-bit). Field names are never transmitted - only numbers. This is the core efficiency advantage: a field named "embedding_vector_dimension" in JSON requires 26 bytes just for the key, plus repetition in every message. In protobuf, the field number "4" encodes as 1-2 bytes regardless of how long the field name is, and that same number is used in every message.

Varint encoding makes small integers compact. The value 1 encodes as a single byte. The value 300 encodes as 2 bytes. This matters for ML because many fields are small integers (class counts, batch sizes, sequence lengths).

For a 512-element float32 feature vector: JSON encodes as approximately 8KB (keys, commas, decimal representations). Protobuf encodes as approximately 2KB (4 bytes per float32 in packed repeated field encoding). The 4x size difference means 4x less bandwidth, 4x faster network transfer, and 5-10x faster parsing because binary parsing is dramatically faster than JSON text parsing.

Q2: What are the four gRPC service types and which is most appropriate for LLM token streaming?

The four gRPC service types are: unary (one request, one response), server streaming (one request, stream of responses), client streaming (stream of requests, one response), and bidirectional streaming (streams in both directions simultaneously).

For LLM token streaming, server streaming is the correct choice. The client sends one generation request (with the prompt, temperature, max_tokens), and the server yields tokens one at a time as they are generated. The client receives each token as it arrives rather than waiting for the complete response.

This is the pattern used by APIs like OpenAI's streaming completions. The alternative - unary RPC - would require waiting for the entire generation to complete before sending any response, which for a 500-token response at 50ms per token would mean a 25-second wait before the user sees anything. Server streaming reduces perceived latency to milliseconds (the first token arrives as soon as it is generated) even when total generation time is long.

Q3: What is the difference between gRPC deadlines and timeouts, and why do deadlines propagate differently?

In gRPC terminology, a timeout is a relative duration ("this RPC should complete within 500ms"), while a deadline is an absolute timestamp ("this entire request chain must complete by 14:30:00.500Z"). When a client makes an RPC with a timeout of 500ms at 14:30:00.000Z, gRPC converts this internally to a deadline of 14:30:00.500Z.

The key advantage of deadlines: they propagate correctly across multi-service call chains. Suppose service A calls service B with a 1000ms timeout, and service B calls service C. Service B passes the remaining deadline - if service B takes 300ms for its own processing, it passes a 700ms deadline (not the original 1000ms) to service C. This prevents the common cascading timeout failure where each service adds a new full timeout, resulting in a request that can wait 5x the original SLA.

In Python gRPC, when you receive a request, context.time_remaining() tells you how much deadline remains. Use this to set timeouts on downstream calls. Never set a fixed downstream timeout that ignores the remaining deadline.

Q4: How do gRPC interceptors work and what are the three most important interceptors to implement for production ML serving?

gRPC interceptors are middleware functions that wrap RPC handlers on both client and server sides. On the server, a ServerInterceptor's intercept_service() method receives the handler and returns a (potentially wrapped) handler. The wrapped handler can execute code before and after the actual RPC, inspect and modify requests and responses, and abort the RPC by calling context.abort().

The three most important interceptors for production ML serving:

  1. Auth interceptor: Validate API keys or JWT tokens before any RPC reaches the service handler. Centralize auth logic so individual handlers do not need to implement it. Remember to skip auth for health check endpoints called by load balancers.

  2. Logging/tracing interceptor: Extract trace IDs from request metadata (or generate new ones), log every RPC with method, latency, status code, and trace ID. This is the foundation for debugging production issues - without per-RPC structured logs, you cannot diagnose why specific requests are slow or failing.

  3. Rate limiting interceptor: Track request rates per client ID extracted from metadata. Return RESOURCE_EXHAUSTED when a client exceeds their quota. This protects model servers from being overwhelmed by misbehaving clients and prevents one client from starving others in a shared ML serving platform.

Q5: Explain gRPC's L7 load balancing requirement and how to satisfy it in Kubernetes.

Standard L4 load balancers operate at the TCP level - they route new TCP connections. gRPC uses HTTP/2, which multiplexes many RPCs over a single TCP connection. If you have 10 gRPC server replicas behind an L4 load balancer, a client opens one TCP connection, that connection is routed to replica #3, and all subsequent RPCs from that client go to replica #3 regardless of load. Replicas 1, 2, 4-10 receive no traffic from that client.

The correct solution is an L7 load balancer that understands HTTP/2 frames and routes at the RPC level (not the connection level). In Kubernetes:

Option 1: Use an NGINX Ingress with nginx.ingress.kubernetes.io/backend-protocol: "GRPC". NGINX terminates the HTTP/2 connection and creates separate upstream connections, distributing individual RPCs across replicas.

Option 2: Use Istio with an Envoy sidecar. Envoy proxies all traffic at the application layer and distributes HTTP/2 streams round-robin or by least-requests across healthy replicas. Istio also adds mTLS, circuit breaking, and distributed tracing automatically.

Option 3: Client-side load balancing with DNS discovery. The gRPC channel resolves the DNS name to multiple A records (one per replica), and the client distributes RPCs directly. No proxy needed, but requires DNS-aware gRPC channel configuration.

Q6: What protobuf schema changes are backward compatible and which will break production systems?

Backward compatible changes (safe to deploy without coordinated rollout):

  • Adding a new field with a new field number. Old deserializers ignore unknown fields. New fields get their default values when reading old messages.
  • Renaming a field. The wire format uses field numbers, not names. Renaming has no effect on binary compatibility.
  • Changing a singular field to a repeated field (in proto3). The last value of a repeated field is treated as the singular value by old deserializers.
  • Adding a new enum value. Old code ignores unknown enum values.

Breaking changes (require coordinated rollout or versioned API):

  • Changing a field number. Binary messages encoded with the old number will be misinterpreted by code expecting the new number.
  • Changing a field type incompatibly (e.g., int32 to string). The wire type changes, causing parse errors.
  • Removing a field and then reusing its field number for a different field. Old serialized messages with the old field will be misinterpreted.
  • Changing a proto2 required field (reason enough to avoid required fields entirely).

The safe protocol for removing a field: mark it reserved in the proto file. This prevents the field number from being accidentally reused and prevents the field name from being used in new code, while keeping full backward read compatibility.

© 2026 EngineersOfAI. All rights reserved.