Skip to main content

:::tip 🎮 Interactive Playground Visualize this concept: Try the ML Microservices demo on the EngineersOfAI Playground - no code required. :::

Microservices for ML Systems

The Monolith Breaking Point

Your ML platform started as a single Flask application. Twelve months ago that was fine - one data scientist, one model, one deployment per week. The app loaded the model at startup, read features from Postgres, and returned predictions. Simple. Fast. Easy to reason about.

Today you have nine data scientists, four production models, three feature engineering pipelines that all share state, and a monitoring script that someone bolted onto the side of the app because "it was easiest." The deploy process takes ninety minutes because every model gets redeployed together, even if only one changed. Last Tuesday, a feature engineering bug in the user pipeline crashed predictions for all four models simultaneously - even the two models that do not use user features at all. The fraud model was down during peak transaction hours because the recommendation feature pipeline hit an OOM error.

This is the moment that drives teams toward microservices. Not ambition, not architectural purity - pain. The monolith has become a liability. Changes in one part of the system cause failures in unrelated parts. Deployments are coupled when they should be independent. Scaling one component requires scaling everything.

Microservices for ML is not about making the system more complicated. It is about making the failure domains smaller, the deployment units independent, and the team ownership clear. But microservices come with a real tax: distributed systems are harder to debug, network calls fail in ways function calls do not, and operational complexity grows with every new service boundary. This lesson teaches you where those boundaries should go, how to build across them safely, and when the monolith was actually the right answer.


Why This Exists: Monolith vs Microservices for ML

The Monolith's Strengths

A monolith is fast to build. One repository, one deploy, one place to look when something breaks. For a team of fewer than five people building fewer than three models, a monolith is almost always the right answer. The speed of iteration beats the flexibility of decomposition.

The monolith also has zero network latency between components. Your feature engineering code and your model inference code run in the same process. There is no serialization, no network hop, no timeout. A function call in Python takes microseconds. An HTTP call to a sidecar service takes milliseconds.

When the Monolith Breaks Down

The monolith breaks down along four axes in ML systems:

Independent deployability: Your recommendation model needs to be updated three times per day. Your fraud model is updated once per week. In a monolith, every recommendation update requires testing and redeploying the fraud model too. This slows iteration and introduces coupling between teams.

Independent scaling: Feature retrieval is memory-bound and needs 32 high-memory nodes. Model inference is compute-bound and needs 8 GPU nodes. In a monolith, you either over-provision both or under-provision one.

Failure isolation: A memory leak in your real-time feature pipeline should not take down your batch scoring service. In a monolith, they share the same process. One OOM kills everything.

Team ownership: As your ML platform grows, separate teams own separate components. The feature team should be able to update the feature computation logic without touching the model team's inference code. Shared codebases make this hard.


ML-Specific Microservices

The key insight for ML system decomposition is that each major function in the ML lifecycle has a different operational profile - different scaling requirements, different update frequencies, different SLAs - and should therefore be a separate service.

Feature Service

The feature service is the most critical and most underrated component in an ML microservices architecture. Its job: given a set of entity IDs (user_id, item_id, session_id), return the feature vector in under 5 milliseconds.

The feature service sits in front of your feature store (typically Redis for low-latency lookups + DynamoDB or Cassandra for durability). It handles batching (fetching features for 200 users in one round-trip to Redis), caching (serving the same user's features multiple times without repeated Redis calls), and schema validation (ensuring features match the schema the model was trained on).

Key design decisions for the feature service:

  • Read path must be microsecond-fast: Redis pipeline for batched lookups
  • Write path is decoupled: Kafka consumer updates Redis asynchronously; the feature service never writes directly
  • Feature versioning: return features with a schema version; model service validates the version matches training

Model Service

The model service loads a trained model artifact and serves predictions over HTTP or gRPC. It is responsible for exactly one thing: turning a feature vector into a prediction. It should know nothing about where features come from.

The model service calls the feature service to get features (or accepts pre-fetched features from the caller), runs inference, and returns the result. It reports latency and prediction distributions to the monitoring service asynchronously (fire and forget - never block the prediction path on monitoring).

Experiment Service

The experiment service handles A/B traffic splitting, feature flag evaluation, and experiment assignment. Before calling the model service, the API gateway checks the experiment service: "which model version should user X see?" The experiment service returns the model version, and the gateway routes accordingly. This allows new model versions to receive a percentage of traffic without code changes.

Monitoring Service

The monitoring service ingests prediction logs, compares current prediction distributions to training-time baselines, detects feature drift, and fires alerts. It is entirely asynchronous - the prediction path never waits for it. Write predictions to a Kafka topic; the monitoring service consumes from that topic offline.


gRPC for Internal ML Service Communication

HTTP/JSON is convenient but expensive. JSON serialization adds CPU overhead. String parsing adds latency. For internal service-to-service calls that happen on the critical path (feature service → model service), gRPC with Protocol Buffers is the right choice.

gRPC gives you:

  • Strongly typed contracts via .proto files - both caller and callee agree on the schema at compile time
  • Binary serialization via Protocol Buffers - 3-10x smaller payloads than JSON
  • Bidirectional streaming - the model service can stream token-by-token responses for LLMs
  • Built-in deadlines and cancellation - set a 20ms deadline; gRPC automatically cancels if exceeded
// feature_service.proto
syntax = "proto3";

package featureservice;

service FeatureService {
rpc GetFeatures (FeatureRequest) returns (FeatureResponse);
rpc GetFeaturesBatch (BatchFeatureRequest) returns (BatchFeatureResponse);
}

message FeatureRequest {
string user_id = 1;
string item_id = 2;
string context_id = 3;
}

message FeatureResponse {
repeated float user_embedding = 1;
repeated float item_embedding = 2;
map<string, float> scalar_features = 3;
string schema_version = 4;
int64 computed_at_ms = 5;
}

message BatchFeatureRequest {
repeated FeatureRequest requests = 1;
}

message BatchFeatureResponse {
repeated FeatureResponse responses = 1;
}
# feature_service_client.py
import grpc
from concurrent.futures import ThreadPoolExecutor
from typing import Optional
import feature_service_pb2
import feature_service_pb2_grpc


class FeatureServiceClient:
"""
gRPC client for the Feature Service.
Handles connection pooling, retries, and deadline management.
"""

def __init__(
self,
host: str = "feature-service:50051",
deadline_ms: int = 20,
max_workers: int = 10,
):
# Shared channel - gRPC manages the connection pool internally
self.channel = grpc.insecure_channel(
host,
options=[
("grpc.keepalive_time_ms", 10000),
("grpc.keepalive_timeout_ms", 5000),
("grpc.max_receive_message_length", 10 * 1024 * 1024), # 10MB
],
)
self.stub = feature_service_pb2_grpc.FeatureServiceStub(self.channel)
self.deadline_seconds = deadline_ms / 1000.0

def get_features(
self,
user_id: str,
item_id: str,
context_id: Optional[str] = None,
) -> Optional[dict]:
"""
Fetch features with a strict deadline.
Returns None on timeout or error - caller must handle gracefully.
"""
request = feature_service_pb2.FeatureRequest(
user_id=user_id,
item_id=item_id,
context_id=context_id or "",
)
try:
response = self.stub.GetFeatures(
request,
timeout=self.deadline_seconds,
)
return {
"user_embedding": list(response.user_embedding),
"item_embedding": list(response.item_embedding),
"scalar_features": dict(response.scalar_features),
"schema_version": response.schema_version,
}
except grpc.RpcError as e:
if e.code() == grpc.StatusCode.DEADLINE_EXCEEDED:
# Log but do not raise - caller uses fallback features
print(f"[FeatureClient] Deadline exceeded for user={user_id}")
else:
print(f"[FeatureClient] gRPC error: {e.code()}: {e.details()}")
return None

def get_features_batch(self, requests: list) -> list:
"""
Fetch features for multiple entities in one RPC call.
More efficient than N individual calls.
"""
batch_request = feature_service_pb2.BatchFeatureRequest(
requests=[
feature_service_pb2.FeatureRequest(**r) for r in requests
]
)
try:
response = self.stub.GetFeaturesBatch(
batch_request,
timeout=self.deadline_seconds,
)
return [self._parse_response(r) for r in response.responses]
except grpc.RpcError:
return [None] * len(requests)

def _parse_response(self, response) -> dict:
return {
"user_embedding": list(response.user_embedding),
"item_embedding": list(response.item_embedding),
"scalar_features": dict(response.scalar_features),
"schema_version": response.schema_version,
}

def close(self):
self.channel.close()

Circuit Breakers: Handling Downstream Failures

In a microservices architecture, services call other services. When the feature service is slow (maybe Redis is under GC pressure), the model service will pile up threads waiting for features. Without a circuit breaker, this cascades: the model service's thread pool fills up, new requests get rejected, and the entire serving stack degrades.

A circuit breaker is a state machine with three states: Closed (normal operation), Open (failing fast without calling the downstream), and Half-Open (testing if the downstream has recovered).

import time
import threading
from enum import Enum
from typing import Callable, Any, Optional


class CircuitState(Enum):
CLOSED = "closed" # Normal operation
OPEN = "open" # Failing fast
HALF_OPEN = "half_open" # Testing recovery


class CircuitBreaker:
"""
Circuit breaker for ML service calls.
Prevents cascading failures when downstream services degrade.

Usage:
cb = CircuitBreaker(failure_threshold=5, recovery_timeout=30)

@cb.call
def get_features(user_id):
return feature_client.get_features(user_id)
"""

def __init__(
self,
failure_threshold: int = 5,
recovery_timeout: float = 30.0,
half_open_max_calls: int = 3,
):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.half_open_max_calls = half_open_max_calls

self._state = CircuitState.CLOSED
self._failure_count = 0
self._last_failure_time: Optional[float] = None
self._half_open_calls = 0
self._lock = threading.Lock()

@property
def state(self) -> CircuitState:
with self._lock:
if self._state == CircuitState.OPEN:
if time.time() - self._last_failure_time > self.recovery_timeout:
self._state = CircuitState.HALF_OPEN
self._half_open_calls = 0
return self._state

def call(self, func: Callable) -> Callable:
"""Decorator to wrap a function with circuit breaker protection."""
def wrapper(*args, **kwargs) -> Any:
current_state = self.state

if current_state == CircuitState.OPEN:
raise RuntimeError(
"Circuit breaker is OPEN - failing fast"
)

if current_state == CircuitState.HALF_OPEN:
with self._lock:
if self._half_open_calls >= self.half_open_max_calls:
raise RuntimeError(
"Circuit breaker HALF_OPEN - max test calls reached"
)
self._half_open_calls += 1

try:
result = func(*args, **kwargs)
self._on_success()
return result
except Exception as e:
self._on_failure()
raise

return wrapper

def _on_success(self):
with self._lock:
self._failure_count = 0
self._state = CircuitState.CLOSED

def _on_failure(self):
with self._lock:
self._failure_count += 1
self._last_failure_time = time.time()
if self._failure_count >= self.failure_threshold:
self._state = CircuitState.OPEN
print(
f"[CircuitBreaker] Opened after {self._failure_count} failures"
)

API Gateway for ML Services

The API gateway is the single entry point for all external traffic. It handles authentication (JWT validation), rate limiting, routing to the correct model version, and request/response logging.

from fastapi import FastAPI, Depends, HTTPException, Header
from fastapi.middleware.trustedhost import TrustedHostMiddleware
import httpx
import asyncio
from typing import Optional
import time


app = FastAPI(title="ML API Gateway")

# Rate limiter (in production, use Redis-backed sliding window)
_request_counts: dict = {}
RATE_LIMIT = 1000 # requests per minute per user


async def verify_token(authorization: str = Header(...)) -> str:
"""Validate JWT and extract user_id."""
if not authorization.startswith("Bearer "):
raise HTTPException(status_code=401, detail="Invalid authorization header")
token = authorization.split(" ")[1]
# In production: verify JWT signature against Keycloak JWKS
user_id = _decode_jwt(token)
return user_id


async def rate_limit(user_id: str = Depends(verify_token)) -> str:
"""Simple in-process rate limiter. Use Redis in production."""
now = int(time.time() // 60) # current minute
key = f"{user_id}:{now}"
_request_counts[key] = _request_counts.get(key, 0) + 1
if _request_counts[key] > RATE_LIMIT:
raise HTTPException(status_code=429, detail="Rate limit exceeded")
return user_id


@app.post("/v1/predict/recommendations")
async def predict_recommendations(
request: dict,
user_id: str = Depends(rate_limit),
):
"""
Route recommendation requests to the model service.
Handles experiment assignment before routing.
"""
# 1. Get experiment assignment
async with httpx.AsyncClient() as client:
exp_response = await client.post(
"http://experiment-service/assign",
json={"user_id": user_id, "experiment": "reco_model"},
timeout=5.0,
)
model_version = exp_response.json().get("variant", "v1")

# 2. Route to the correct model service version
model_url = f"http://model-service-{model_version}/predict"
async with httpx.AsyncClient() as client:
response = await client.post(
model_url,
json={**request, "user_id": user_id},
timeout=0.1, # 100ms hard timeout
)

return response.json()


def _decode_jwt(token: str) -> str:
"""Placeholder - in production, verify RS256 signature."""
import base64, json
payload = token.split(".")[1]
padded = payload + "=" * (4 - len(payload) % 4)
decoded = json.loads(base64.b64decode(padded))
return decoded.get("sub", "unknown")

Kubernetes for ML Microservices

Each ML microservice runs as a Kubernetes Deployment. GPU-backed services use resources.limits.nvidia.com/gpu.

# model-service-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: model-service-v2
labels:
app: model-service
version: v2
spec:
replicas: 4
selector:
matchLabels:
app: model-service
version: v2
template:
metadata:
labels:
app: model-service
version: v2
spec:
containers:
- name: model-server
image: registry.internal/model-service:v2.3.1
ports:
- containerPort: 8080
name: http
- containerPort: 50051
name: grpc
resources:
requests:
memory: "8Gi"
cpu: "4"
nvidia.com/gpu: "1"
limits:
memory: "16Gi"
cpu: "8"
nvidia.com/gpu: "1"
env:
- name: MODEL_VERSION
value: "v2"
- name: FEATURE_SERVICE_ADDR
value: "feature-service:50051"
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30 # allow time for model to load
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 60
periodSeconds: 30
---
# Horizontal Pod Autoscaler based on GPU utilization
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: model-service-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: model-service-v2
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70

The Distributed Systems Tax

Before you decompose your monolith, understand what you are buying.

You are buying: independent deployability, independent scaling, fault isolation, clear team ownership.

You are paying: network latency on every cross-service call (add 1-5ms per hop), complex distributed tracing (you need Jaeger or Zipkin to debug a request across 4 services), distributed transaction complexity (a database transaction across microservices requires sagas or two-phase commit), operational overhead (12 services means 12 CI/CD pipelines, 12 health checks, 12 alert configurations).

The break-even point for ML systems is roughly: 3+ teams, 5+ models in production, 3+ different scaling requirements. Below that threshold, a well-structured monolith with clear module boundaries is faster to build and easier to operate.


:::danger Tight Coupling Between Services

The most common microservices failure is extracting services that are actually tightly coupled. If your model service calls the feature service synchronously for every prediction, and the feature service calls the user service synchronously for every feature, you have a synchronous call chain three layers deep. A 99th percentile latency of 20ms at each layer compounds to 60ms+ at the top - and a failure at any layer fails the whole request.

Solution: Design for async where possible. Pre-compute and push features to the feature store (async write) rather than computing them on the serving path (sync call). Only make synchronous cross-service calls for components that are truly on the critical latency path. :::

:::warning Service Discovery and Configuration

In a microservices ML platform, services need to find each other and share configuration. Do not hardcode service addresses. Use Kubernetes Service objects (DNS-based discovery), and use ConfigMaps or a configuration service (Consul, etcd) for runtime parameters like model version routing rules and feature schema versions. Hardcoded addresses are a deployment-time landmine that shows up at 2 AM when someone renames a service. :::


Interview Q&A

Q1: When should an ML team split from a monolith into microservices?

The trigger is almost always one of three things: (1) deployments of one model keep breaking other models - failure isolation needed; (2) different components need different compute types (CPU for feature retrieval, GPU for inference, memory-optimized for feature store) - independent scaling needed; (3) separate teams own separate components and they are stepping on each other's deployments - team autonomy needed.

The anti-pattern is splitting into microservices pre-emptively, before you have the pain. A well-structured monolith with clean module boundaries serves teams of up to 5-7 engineers and 3-5 models without significant friction. The overhead of operating 8 microservices is significant and should be justified by concrete problems, not architectural aesthetics.


Q2: How do you design the feature service in an ML microservices architecture?

The feature service has two distinct paths: the write path and the read path. The write path is asynchronous - Kafka consumers write computed features to Redis and Cassandra periodically. This path is decoupled from serving and can tolerate latency. The read path is synchronous and must be extremely fast - under 5ms p99. The read path accepts an entity ID, looks up the feature vector in Redis, validates the schema version, and returns the feature vector.

Key design decisions: use Redis pipeline for batched lookups (fetch 50 user features in one network round-trip instead of 50); use protobuf for serialization (smaller payloads than JSON); implement a local in-process LRU cache for the 1% of users that generate 20% of requests (hot users); always include a schema version with the feature vector so the model service can validate that features match the version it was trained on.


Q3: Explain circuit breakers in the context of ML serving.

A circuit breaker is a state machine that sits between two services. In the Closed state (normal operation), calls pass through. If failures exceed a threshold (say, 5 failures in 10 seconds), the breaker opens. In the Open state, calls are rejected immediately without trying the downstream service - this is "failing fast." After a configurable timeout (say, 30 seconds), the breaker moves to Half-Open and allows a small number of test calls. If those succeed, it closes again; if they fail, it reopens.

For ML systems, circuit breakers matter most between the model service and the feature service. If the feature service is slow (Redis under memory pressure, network congestion), without a circuit breaker the model service will exhaust its thread pool waiting for feature responses. With a circuit breaker, after the first wave of timeouts the breaker opens and the model service falls back to default features (zeros, global averages) - degraded quality, but the service remains responsive.


Q4: How does service discovery work in a Kubernetes-based ML platform?

Kubernetes provides DNS-based service discovery through Service objects. When you create a Kubernetes Service named feature-service in the ml-platform namespace, any pod in the cluster can reach it at feature-service.ml-platform.svc.cluster.local. This DNS name resolves to the ClusterIP, which is a virtual IP that kube-proxy routes to healthy pods via iptables/eBPF rules.

For ML platforms, you typically create one Service per microservice. The model service connects to feature-service:50051 (gRPC) or feature-service:8080 (HTTP) - Kubernetes resolves the hostname, load-balances across pod replicas, and handles pod restarts transparently. For cross-namespace calls (different ML teams, different namespaces), use the fully qualified domain name. For external traffic, use an Ingress or LoadBalancer Service in front of your API gateway.


Q5: What is a service mesh and when does an ML platform need one?

A service mesh (Istio, Linkerd) is a layer of infrastructure that handles cross-service communication concerns - mutual TLS authentication, load balancing, traffic splitting, distributed tracing - transparently, without changing application code. A sidecar proxy (Envoy) runs alongside each service pod and intercepts all network traffic.

An ML platform needs a service mesh when: (1) you need mTLS between internal services for compliance (financial services, healthcare); (2) you want traffic splitting for canary model deployments without custom routing code; (3) you need distributed tracing across 8+ services to debug latency issues. The overhead is real - the Envoy sidecar adds 1-3ms per request and 100-200MB of memory per pod. For most ML platforms, a service mesh becomes worth the overhead at 10+ services or when compliance requirements mandate encrypted internal traffic.


Summary

Microservices for ML systems decompose the monolith along operational boundaries: feature service (fast reads), model service (inference), experiment service (A/B routing), and monitoring service (async). gRPC with Protocol Buffers is the right internal communication protocol - lower latency, stronger typing, better streaming than HTTP/JSON. Circuit breakers prevent cascading failures when downstream services degrade. Kubernetes provides container orchestration, service discovery, and autoscaling. The distributed systems tax is real: only decompose when you have the pain - failure coupling, scaling mismatch, or team ownership conflicts - that microservices genuinely solve.

© 2026 EngineersOfAI. All rights reserved.