Skip to main content

:::tip 🎮 Interactive Playground Visualize this concept: Try the REST vs gRPC for ML demo on the EngineersOfAI Playground - no code required. :::

REST vs gRPC for ML Model Serving

The Production Scenario

It's 2:00 AM and your pager fires. The on-call runbook says: "Recommendation service p99 latency spiked from 35ms to 210ms." You join the incident call. The inference service is fine. The model is fine. GPU utilization is fine. The problem is the call between your product backend and your model serving layer - they're going through a REST API with JSON serialization, and someone recently doubled the feature vector size to improve recommendation quality. Each request is now serializing and deserializing a 4,000-dimensional float array as JSON text. That's 180ms of serialization overhead for a model that takes 30ms to actually run.

This is not a hypothetical. It happens in every organization that grows a model serving layer without thinking carefully about protocol choice. JSON is human-readable, which is wonderful for debugging at 2:00 PM and actively harmful at 2:00 AM when you're explaining to your VP why the homepage is slow.

The engineers who avoid this incident are the ones who understand what REST and gRPC actually do under the hood - not as abstract concepts, but as concrete mechanisms with measurable costs. They know that gRPC with Protocol Buffers is 5 to 10 times faster than REST with JSON for the specific pattern of "send a dense numerical tensor, receive a probability distribution," and they made that choice before the feature vector doubled.

This lesson gives you the mental model to make that choice correctly the first time. We will start with how each protocol actually works, build an understanding of where the performance difference comes from, and end with a production decision framework you can apply to any ML serving architecture.

Why This Exists - The Problem with Raw HTTP for ML

Before gRPC existed, every internal service spoke HTTP with JSON. This made sense when services exchanged small structured payloads - a user ID here, a configuration object there. JSON is text. Text is readable. If something goes wrong you open Wireshark or check your logs and you can see exactly what was sent.

The problem with ML workloads is that the data is not small and not primarily structured. A typical recommendation system input might be:

  • A user embedding vector: 256 floats
  • Item candidate embeddings: 100 items x 64 floats = 6,400 floats
  • Contextual features: 50 floats
  • Categorical features encoded as integers: 30 values

Total: roughly 6,700 numerical values per request. In JSON, a 32-bit float serializes to something like 0.7834291 - that is 9 characters, 9 bytes. So 6,700 floats becomes roughly 60 KB of JSON text per request, plus the overhead of parsing that text into Python floats on the server side.

The same data in Protocol Buffers - gRPC's wire format - encodes each float in exactly 4 bytes. 6,700 floats = 26.8 KB, with near-zero parsing overhead on the receiving end because the bytes map directly to native float arrays. At 10,000 requests per second, the difference between these two approaches is the difference between needing 8 serving machines and needing 3.

This is the problem gRPC was built to solve: fast, typed, efficient communication between internal services where the payload is dense numerical data and human readability is not a requirement.

Historical Context

REST (Representational State Transfer) was defined by Roy Fielding in his 2000 PhD dissertation at UC Irvine. It was an architectural style for distributed hypermedia systems - designed around the idea that any client that spoke HTTP could interact with any server, without prior coordination. The stateless constraint and the use of human-readable JSON made REST ideal for public APIs: you could hand a developer a URL and they could call it from curl immediately.

gRPC was developed at Google and open-sourced in 2015. It grew out of an internal system called Stubby, which Google had been using since the early 2000s to handle billions of inter-service calls per second. Google's internal services exchange dense structured data - protobufs were already their standard wire format. gRPC took Stubby's design, rebuilt it on HTTP/2, added the proto3 schema language, and released it as an open standard.

The key insight that drove gRPC's design: for internal service-to-service communication, human readability is irrelevant. What matters is throughput, latency, and strong typing to catch API contract violations at compile time rather than at 2:00 AM in production.

Core Concepts

How REST Works

REST over HTTP/1.1 is a request-response protocol where each request gets its own TCP connection (or a connection from a pool). The message format is text:

POST /predict HTTP/1.1
Host: model-service:8080
Content-Type: application/json
Content-Length: 284

{"features": [0.12, 0.87, 0.34, ..., 0.91], "model_version": "v3"}

The server reads this text, parses JSON into a Python dictionary, extracts the features list, converts it to a numpy array, runs inference, converts the result back to a Python list, serializes to JSON, and sends the text response back. Every step involving text parsing and type conversion is pure overhead.

How gRPC Works

gRPC uses HTTP/2 as its transport and Protocol Buffers as its serialization format. You define your service interface in a .proto file:

syntax = "proto3";

service ModelService {
rpc Predict (PredictRequest) returns (PredictResponse);
rpc PredictStream (PredictRequest) returns (stream PredictResponse);
}

message PredictRequest {
repeated float features = 1;
string model_version = 2;
}

message PredictResponse {
repeated float probabilities = 1;
float latency_ms = 2;
}

The .proto file is compiled into client and server stubs in any language. The wire encoding for a list of floats uses field tag + length prefix + raw IEEE 754 bytes. No parsing. No string conversion. The CPU memcpy's the bytes into a float array.

HTTP/2 adds multiplexing - multiple gRPC calls can share a single TCP connection. This eliminates the connection establishment overhead (TCP handshake + TLS handshake) that plagues HTTP/1.1 at high request rates.

The Performance Gap Quantified

For a request carrying a 512-dimensional float vector:

MetricREST + JSONgRPC + Protobuf
Payload size~4.5 KB~2.1 KB
Serialization time~0.8 ms~0.05 ms
Deserialization time~1.2 ms~0.04 ms
Connection overhead~2ms per new conn~0 (multiplexed)
CPU per 10K req/s~18 cores~3 cores

At 10,000 requests per second with a 512-dim input, JSON serialization alone consumes roughly 18 CPU cores just for encoding and decoding. The same workload in gRPC uses 3 cores. The remaining 15 cores can handle more traffic or be eliminated from your compute bill.

Protocol Buffers Deep Dive

Understanding protobuf encoding explains why it is fast. Consider encoding the integer 150 in field number 1:

Binary: 0x08 0x96 0x01

Three bytes. No quotes. No commas. No field name string. The field number (1) and wire type (varint) are packed into the first byte using bitwise encoding. The value 150 uses variable-length encoding: values under 128 fit in 1 byte, larger values use 2 bytes.

For floats specifically, protobuf uses wire type 5 (32-bit), which encodes each float as exactly 4 bytes of IEEE 754 binary. A repeated float field encodes all values in a packed format with one field tag prefix for the entire array - not one tag per element. This is why protobuf is so efficient for feature vectors.

import struct

# JSON encoding of a float - variable length string
value = 0.12345678
json_encoded = str(value) # "0.12345678" - 10 bytes, variable length
print(len(json_encoded)) # 10

# Protobuf encoding of a float (wire type 5) - always 4 bytes
proto_encoded = struct.pack('<f', value)
print(len(proto_encoded)) # 4

# For a 512-dim vector:
dim = 512
json_size = dim * 10 # rough estimate: ~5.1 KB
proto_size = dim * 4 # exact: 2.0 KB
print(f"JSON: {json_size} bytes, Proto: {proto_size} bytes")
# JSON: 5120 bytes, Proto: 2048 bytes

Implementation: REST Model Serving

# rest_server.py
from fastapi import FastAPI
from pydantic import BaseModel
import numpy as np
import time
import uvicorn

app = FastAPI()


class FakeModel:
def predict(self, features: np.ndarray) -> np.ndarray:
time.sleep(0.005) # 5ms inference simulation
return np.random.dirichlet(np.ones(10)).astype(np.float32)


model = FakeModel()


class PredictRequest(BaseModel):
features: list[float]
model_version: str = "v1"


class PredictResponse(BaseModel):
probabilities: list[float]
latency_ms: float


@app.post("/predict", response_model=PredictResponse)
async def predict(request: PredictRequest):
start = time.perf_counter()

# JSON already deserialized by FastAPI/Pydantic into a Python list
# This allocation from list to ndarray happens on every request
features = np.array(request.features, dtype=np.float32)

probabilities = model.predict(features)

latency_ms = (time.perf_counter() - start) * 1000

return PredictResponse(
probabilities=probabilities.tolist(),
latency_ms=latency_ms
)


if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8080, workers=4)
# rest_client.py - benchmark REST endpoint
import requests
import numpy as np
import time
import statistics


def benchmark_rest(
n_requests: int = 1000,
feature_dim: int = 512,
host: str = "localhost:8080"
):
url = f"http://{host}/predict"
features = np.random.randn(feature_dim).tolist()
payload = {"features": features, "model_version": "v1"}

# Warmup
for _ in range(10):
requests.post(url, json=payload)

latencies = []
for _ in range(n_requests):
start = time.perf_counter()
response = requests.post(url, json=payload)
response.raise_for_status()
latency = (time.perf_counter() - start) * 1000
latencies.append(latency)

latencies.sort()
print(f"REST Results ({n_requests} requests, dim={feature_dim}):")
print(f" p50: {latencies[int(n_requests * 0.50)]:.1f}ms")
print(f" p95: {latencies[int(n_requests * 0.95)]:.1f}ms")
print(f" p99: {latencies[int(n_requests * 0.99)]:.1f}ms")
print(f" mean: {statistics.mean(latencies):.1f}ms")


benchmark_rest()

Implementation: gRPC Model Serving

First, define the proto schema:

// model_service.proto
syntax = "proto3";

package modelservice;

service ModelService {
// Unary RPC: one request, one response
rpc Predict (PredictRequest) returns (PredictResponse);

// Server streaming: one request, stream of responses (LLM tokens)
rpc PredictStream (PredictRequest) returns (stream PredictResponse);

// Bidirectional streaming: useful for batch processing pipelines
rpc PredictBatch (stream PredictRequest) returns (stream PredictResponse);
}

message PredictRequest {
repeated float features = 1 [packed = true];
string model_version = 2;
string request_id = 3;
}

message PredictResponse {
repeated float probabilities = 1 [packed = true];
float latency_ms = 2;
string request_id = 3;
string model_version = 4;
}

Generate stubs: python -m grpc_tools.protoc -I. --python_out=. --grpc_python_out=. model_service.proto

Server implementation:

# grpc_server.py
import grpc
import numpy as np
import time
from concurrent import futures

import model_service_pb2
import model_service_pb2_grpc


class FakeModel:
def predict(self, features: np.ndarray) -> np.ndarray:
time.sleep(0.005)
return np.random.dirichlet(np.ones(10)).astype(np.float32)


class ModelServiceServicer(model_service_pb2_grpc.ModelServiceServicer):
def __init__(self):
self.model = FakeModel()

def Predict(self, request, context):
start = time.perf_counter()

# Protobuf repeated float maps directly to a Python list of floats
# No JSON parsing - just read the bytes
features = np.array(request.features, dtype=np.float32)

probabilities = self.model.predict(features)
latency_ms = (time.perf_counter() - start) * 1000

return model_service_pb2.PredictResponse(
probabilities=probabilities.tolist(),
latency_ms=latency_ms,
request_id=request.request_id,
model_version=request.model_version,
)

def PredictStream(self, request, context):
"""Server streaming - yields partial results, one at a time.
Perfect pattern for LLM token-by-token generation."""
features = np.array(request.features, dtype=np.float32)

for i in range(10):
time.sleep(0.002) # 2ms per token
token_probs = np.random.dirichlet(np.ones(5)).astype(np.float32)
yield model_service_pb2.PredictResponse(
probabilities=token_probs.tolist(),
latency_ms=2.0,
request_id=request.request_id,
)

def PredictBatch(self, request_iterator, context):
"""Bidirectional streaming - process each request and stream back.
Allows the client to pipeline requests without waiting for each response."""
for request in request_iterator:
features = np.array(request.features, dtype=np.float32)
probs = self.model.predict(features)
yield model_service_pb2.PredictResponse(
probabilities=probs.tolist(),
latency_ms=5.0,
request_id=request.request_id,
)


def serve(port: int = 50051):
server = grpc.server(
futures.ThreadPoolExecutor(max_workers=10),
options=[
("grpc.max_send_message_length", 50 * 1024 * 1024),
("grpc.max_receive_message_length", 50 * 1024 * 1024),
],
)
model_service_pb2_grpc.add_ModelServiceServicer_to_server(
ModelServiceServicer(), server
)
server.add_insecure_port(f"[::]:{port}")
server.start()
print(f"gRPC server listening on port {port}")
server.wait_for_termination()


if __name__ == "__main__":
serve()

Client with benchmark and streaming demo:

# grpc_client.py
import grpc
import numpy as np
import time
import statistics

import model_service_pb2
import model_service_pb2_grpc


def get_channel(host: str = "localhost:50051") -> grpc.Channel:
"""Create a reusable gRPC channel with production settings."""
return grpc.insecure_channel(
host,
options=[
# Client-side load balancing (requires headless DNS)
("grpc.lb_policy_name", "round_robin"),
# Keepalive: prevent firewalls from closing idle connections
("grpc.keepalive_time_ms", 10_000),
("grpc.keepalive_timeout_ms", 5_000),
("grpc.keepalive_permit_without_calls", True),
# Message size limits
("grpc.max_send_message_length", 50 * 1024 * 1024),
("grpc.max_receive_message_length", 50 * 1024 * 1024),
],
)


def benchmark_grpc(n_requests: int = 1000, feature_dim: int = 512):
channel = get_channel()
stub = model_service_pb2_grpc.ModelServiceStub(channel)

features = np.random.randn(feature_dim).astype(np.float32).tolist()

# Warmup
for _ in range(10):
stub.Predict(model_service_pb2.PredictRequest(
features=features, model_version="v1", request_id="warmup"
))

latencies = []
for i in range(n_requests):
request = model_service_pb2.PredictRequest(
features=features,
model_version="v1",
request_id=f"req-{i}",
)
start = time.perf_counter()
response = stub.Predict(request, timeout=0.5) # 500ms deadline
latency = (time.perf_counter() - start) * 1000
latencies.append(latency)

channel.close()

latencies.sort()
print(f"gRPC Results ({n_requests} requests, dim={feature_dim}):")
print(f" p50: {latencies[int(n_requests * 0.50)]:.1f}ms")
print(f" p95: {latencies[int(n_requests * 0.95)]:.1f}ms")
print(f" p99: {latencies[int(n_requests * 0.99)]:.1f}ms")
print(f" mean: {statistics.mean(latencies):.1f}ms")


def demo_streaming(feature_dim: int = 512):
"""Show token-by-token streaming - the LLM generation pattern."""
channel = get_channel()
stub = model_service_pb2_grpc.ModelServiceStub(channel)

features = np.random.randn(feature_dim).astype(np.float32).tolist()
request = model_service_pb2.PredictRequest(
features=features,
model_version="v1",
request_id="stream-demo",
)

print("\nStreaming tokens:")
for i, response in enumerate(stub.PredictStream(request)):
top_token = np.argmax(response.probabilities)
top_prob = max(response.probabilities)
print(f" Token {i:02d}: class={top_token}, prob={top_prob:.4f}")

channel.close()


if __name__ == "__main__":
benchmark_grpc()
demo_streaming()

Architecture: Request Flow Comparison

gRPC Load Balancing for ML Services

HTTP/2 multiplexing creates a subtle load balancing problem. A layer-4 load balancer distributes TCP connections. Because gRPC multiplexes many requests onto a single TCP connection, a layer-4 LB routes all requests from one client to one backend - defeating load balancing.

The solutions:

1. Client-side round-robin (recommended for Kubernetes):

# Use DNS that resolves to individual pod IPs (headless service)
channel = grpc.insecure_channel(
"dns:///model-service.default.svc.cluster.local:50051",
options=[("grpc.lb_policy_name", "round_robin")],
)
# kubernetes/headless-service.yaml
apiVersion: v1
kind: Service
metadata:
name: model-service
spec:
clusterIP: None # Headless: DNS returns all pod IPs
selector:
app: model-service
ports:
- name: grpc
port: 50051

2. Envoy proxy (for mixed traffic or browser clients):

Envoy understands HTTP/2 frames and can distribute individual gRPC calls across backends. It also translates between gRPC-Web (browser-compatible) and native gRPC, making it the standard choice when you have both browser and internal-service clients.

How TensorFlow Serving Uses gRPC

TensorFlow Serving exposes both REST and gRPC. Its gRPC service is the production path for high-throughput workloads:

# Calling TensorFlow Serving via gRPC
import grpc
import numpy as np
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc
import tensorflow as tf


def predict_with_tfserving(
model_name: str,
features: np.ndarray,
host: str = "localhost:8500",
) -> np.ndarray:
channel = grpc.insecure_channel(host)
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)

# Build request using TF Serving's proto schema
request = predict_pb2.PredictRequest()
request.model_spec.name = model_name
request.model_spec.signature_name = "serving_default"

# Convert numpy array to TensorProto
input_tensor = tf.make_tensor_proto(
features[np.newaxis, :], # add batch dimension
dtype=tf.float32,
)
request.inputs["input_1"].CopyFrom(input_tensor)

# Make the call with a 500ms deadline
response = stub.Predict(request, timeout=0.5)

# Extract output tensor
output = tf.make_ndarray(response.outputs["output_0"])
channel.close()
return output[0] # remove batch dimension

Decision Framework

REST wins when: the client is a browser, the API is public-facing, the payload is small (under 1KB), request rate is modest (under 1,000/sec), or maximum debuggability with curl and standard HTTP tools is required.

gRPC wins when: the call is internal service-to-service, feature vectors or embeddings are large, request rate is high (thousands per second), bidirectional streaming is needed (LLM tokens, audio), or strong compile-time API contracts matter.

Production Engineering Notes

Health checks: gRPC has a standard health check protocol that Kubernetes probes should use:

from grpc_health.v1 import health, health_pb2, health_pb2_grpc

health_servicer = health.HealthServicer()
health_pb2_grpc.add_HealthServicer_to_server(health_servicer, server)
health_servicer.set("ModelService", health_pb2.HealthCheckResponse.SERVING)

Observability interceptors:

import grpc
import time
import logging


class LatencyInterceptor(grpc.ServerInterceptor):
def intercept_service(self, continuation, handler_call_details):
start = time.perf_counter()
result = continuation(handler_call_details)
latency_ms = (time.perf_counter() - start) * 1000
method = handler_call_details.method
logging.info(f"gRPC {method} completed in {latency_ms:.1f}ms")
return result

Always set deadlines on client calls:

try:
response = stub.Predict(request, timeout=0.5) # 500ms hard deadline
except grpc.RpcError as e:
if e.code() == grpc.StatusCode.DEADLINE_EXCEEDED:
response = serve_cached_fallback()
else:
raise

:::warning gRPC and Browser Clients Browsers cannot speak native gRPC - they do not support HTTP/2 trailers, which gRPC requires for status codes. If your client is a browser, you need either gRPC-Web (with an Envoy proxy translating between gRPC-Web and gRPC) or a REST-to-gRPC gateway. Do not assume gRPC everywhere just because it is faster. :::

:::danger The JSON Float Precision Trap When you round-trip floats through JSON - numpy float32 to Python float to JSON string to JSON parse to Python float to numpy float32 - you lose precision. The JSON spec does not guarantee lossless float round-trips. For model inputs and outputs that require exact float32 representation, this is a silent correctness bug. gRPC binary float encoding is lossless. If you must use REST and precision matters, encode raw bytes: import base64; base64.b64encode(features.tobytes()).decode() and decode on the server side. :::

:::danger Not Reusing gRPC Channels Creating a new gRPC channel per request is as harmful as a new HTTP connection per request - gRPC channel creation involves TLS negotiation and HTTP/2 settings exchange. Create the channel at application startup and reuse it for the lifetime of the process. Use a channel pool if you need isolation between request contexts. :::

:::warning Forgetting Message Size Limits gRPC's default maximum message size is 4MB. A batch of 100 requests with 1,024-dim embeddings is 100 * 1024 * 4 bytes = 400KB - fine. Increase batch size or embedding dimension later without updating the limit and you get RESOURCE_EXHAUSTED: Received message larger than max in production. Set explicit limits during initial setup on both client and server. :::

Interview Q&A

Q: What is the primary performance advantage of gRPC over REST for ML workloads?

The advantage has two components. First, Protocol Buffers encode data in binary format - floats as raw 4-byte IEEE 754, not decimal strings. Serializing a 512-dim float vector takes roughly 0.05ms in protobuf versus 1.0ms in JSON, a 20x speedup for just the encoding step. Second, gRPC uses HTTP/2 which multiplexes many requests over a single TCP connection, eliminating the per-request TCP handshake overhead and allowing thousands of concurrent in-flight requests on one channel. Combined, these make gRPC 5-10x more efficient for the dense-tensor payloads typical in ML inference.

Q: When would you choose REST over gRPC for a model serving API?

Choose REST when: the client is a browser (gRPC requires proxies for browser support), the API is public-facing and developer experience is prioritized over throughput, payloads are small and request rates are modest (under a few hundred per second), or you need maximum debuggability with curl and standard HTTP tools. REST's simplicity has genuine value in early-stage systems or external-facing APIs where the consumer base is broad and varied.

Q: How does gRPC handle load balancing differently from REST?

HTTP/1.1 opens one connection per request or reuses pool members, so layer-4 load balancers distributing TCP connections work well. gRPC uses HTTP/2 which multiplexes all calls onto a single long-lived connection. A layer-4 LB sees one connection and routes all traffic to one backend - defeating load balancing entirely. The solution is either client-side load balancing (gRPC's built-in round-robin policy resolves DNS to individual pod IPs and distributes calls) or a layer-7 proxy like Envoy that understands HTTP/2 frames and distributes individual RPC calls across backends.

Q: How would you handle adding a new feature to your ML model's input without breaking existing clients?

Protocol Buffers are designed for backward and forward compatibility. Adding a new field with a new field number is safe - never reuse field numbers. Old clients will send requests without the new field, and the server receives the default value (0 for numbers, empty string for strings). Old servers receive requests with the new field and ignore it. Deploy the new server first, then roll out clients that send the new field, with no coordination required. This is exactly the schema evolution pattern needed for iterative ML feature development.

Q: How does gRPC streaming enable the LLM token-by-token response pattern?

In server-side gRPC streaming, the client sends one request and the server sends a stream of responses, each arriving as soon as it is ready. For LLMs, each generated token becomes one streaming response message. The client reads from the stream iterator and renders each token as it arrives - this is why ChatGPT appears to type rather than delivering the full response after a long wait. REST alternatives are either polling (latency plus overhead) or Server-Sent Events (limited bidirectional support). gRPC streaming is cleaner, more efficient, and supports both server streaming (one-to-many tokens) and bidirectional streaming (interactive multi-turn sessions).

Q: A model serving endpoint has p99 latency of 200ms. Profiling shows 5ms for model inference and 195ms for overhead. What would you investigate first?

I would profile the serialization and deserialization path. If the endpoint uses REST with JSON and large feature vectors, JSON encode and decode is the likely culprit. I would measure: time to deserialize the request body, time to convert the Python list to a numpy array, and time to serialize the response. If serialization dominates, migrating to gRPC with Protobuf can reduce that overhead by 20x. If serialization is not the bottleneck, I would investigate network round-trip time (co-location issue between caller and model service), Python GIL contention (multiple threads contending for the interpreter lock), or cold model loading if the model is not kept warm between requests.

© 2026 EngineersOfAI. All rights reserved.