DNS, Service Discovery, and Consul
You deploy a new version of your LLM inference service. Ten pods in Kubernetes. The deployment rolls out. Within 30 seconds, most traffic has migrated to the new pods. But some clients - specifically, a batch processing pipeline running on a separate VM - are still sending requests to the old pods, which are now terminating. Requests are failing. The batch job was configured with a hardcoded IP address from last week.
Hardcoded IP addresses are the original sin of distributed systems. The moment you write 10.0.4.23 in a config file, you are betting that the service at that address will never move, never be replaced, never fail and be rescheduled elsewhere. In a Kubernetes cluster where pods come and go continuously, this bet fails within hours. In a multi-region ML platform where inference replicas are scaled up and down based on GPU availability and cost, it fails within minutes.
The solution - service discovery - is the mechanism by which services find each other without knowing each other's addresses in advance. Instead of an IP, you have a name: llm-inference.ml-serving.svc.cluster.local. That name resolves to whatever IP addresses are currently healthy and running the service. When a pod is replaced, the name continues to work. When the service scales from 10 to 50 replicas, the name automatically distributes load across all 50. When a replica fails its health check, the name stops returning that replica's IP before any client has a chance to send it a request.
DNS is the oldest and most universal form of service discovery - it is the protocol that translates names to addresses, and it has been doing this for the entire internet since 1983. But DNS was designed for relatively static infrastructure. Modern ML platforms need service discovery that can react to changes in seconds, not minutes, and that can encode rich health information beyond just "this IP exists." Consul, etcd, and Kubernetes' built-in DNS (powered by CoreDNS) are the tools that extend DNS with the dynamism and health-awareness that ML infrastructure requires.
This lesson builds a precise understanding of how DNS works at the protocol level, how Kubernetes uses it for service discovery, how Consul extends service discovery with health checking and multi-datacenter routing, and how ML serving systems use these tools to register model endpoints, perform A/B routing, and fail over gracefully. By the end, you will understand how a request for https://inference.production.svc.cluster.local/v1/predict finds the right pod, and what happens when that pod dies mid-request.
Why This Exists
In the early internet, every computer that needed to communicate with every other computer had a file called /etc/hosts that mapped hostnames to IP addresses. The entire internet's address book was distributed as a single text file called HOSTS.TXT, maintained at Stanford Research Institute, and downloaded periodically. This worked fine when there were 200 hosts on the internet. By 1982, the network had grown to thousands of hosts and the centralized file was updated twice a week - meaning addresses were wrong for days at a time, and the network traffic to distribute the file was becoming problematic.
Paul Mockapetris invented DNS (Domain Name System) in 1983, published as RFC 882 and RFC 883. The key insight was to distribute the address book: instead of one central file, each organization manages its own portion of the namespace (a "zone"), and a hierarchical system of servers distributes the lookup process. The design has scaled from thousands to hundreds of millions of domains without fundamental architectural changes.
Service discovery in microservices and distributed systems faces the same problem DNS solved - but at a different time scale. DNS was designed for infrastructure that changes on the scale of days (new servers added, old ones retired). Kubernetes pods change on the scale of seconds (pod crashes, rolling deployments, autoscaling). The modern service discovery stack extends DNS with lower TTLs, active health checking, and real-time catalog updates to match that pace.
Historical Context
DNS was standardized in RFC 1034 and RFC 1035 in 1987, refining the 1983 RFCs. These documents remain the authoritative specification today - a remarkable 37-year-old standard still in use unchanged at its core.
ZooKeeper, developed at Yahoo in 2007, was the first widely deployed distributed coordination service that enabled dynamic service discovery in large clusters. Hadoop clusters used ZooKeeper for leader election and service registry. The pattern spread to HBase, Kafka (still uses ZooKeeper, though migrating to KRaft), and early microservice deployments at Twitter and Netflix.
etcd was developed by CoreOS in 2013 as a simpler alternative to ZooKeeper, using Raft consensus instead of ZooKeeper's Zab protocol. Kubernetes adopted etcd as its primary data store for all cluster state, including service endpoint information.
Consul was released by HashiCorp in 2014, adding multi-datacenter support, integrated health checking, and a DNS interface on top of a distributed key-value store. It became the dominant service mesh and service discovery tool for non-Kubernetes infrastructure.
CoreDNS, adopted as Kubernetes' default DNS server in 1.13 (2019), replaced kube-dns and provides the *.svc.cluster.local DNS resolution that Kubernetes services rely on.
Core Concepts: The DNS Resolution Chain
When a process in a Kubernetes pod resolves llm-inference.ml-serving.svc.cluster.local, here is the exact sequence of events:
DNS Record Types for Service Discovery
Different record types serve different roles in ML infrastructure:
| Record | Purpose | ML Use Case |
|---|---|---|
| A | Maps hostname to IPv4 address | Pod IP for inference replicas |
| AAAA | Maps hostname to IPv6 address | IPv6-native clusters |
| CNAME | Alias: one name to another | inference-v2.example.com -> inference.example.com |
| SRV | Service location: name, port, priority, weight | gRPC service port discovery |
| TXT | Arbitrary text metadata | Model version, capability tags |
SRV records are particularly useful for ML serving because they encode both the hostname and port:
# SRV record format: priority weight port target
_grpc._tcp.triton.ml-serving.svc.cluster.local. 300 IN SRV 0 10 8001 triton-0.triton.ml-serving.svc.cluster.local.
_grpc._tcp.triton.ml-serving.svc.cluster.local. 300 IN SRV 0 10 8001 triton-1.triton.ml-serving.svc.cluster.local.
# Client can discover both host AND port from a single DNS lookup
# Weight field (10) enables weighted load balancing via DNS
TTL and Caching
Every DNS record has a TTL (Time To Live) in seconds. The stub resolver on a client caches records for the TTL duration. For service discovery:
- TTL too high: clients cache stale addresses, traffic goes to dead pods after failover
- TTL too low: high DNS query load, CoreDNS becomes a bottleneck
Kubernetes services use a TTL of 5-30 seconds for ClusterIP services. For headless services (used with StatefulSets), each pod's DNS record has a TTL of 5 seconds by default.
For fast failover in ML serving: TTL=5s + health check=5s + timeout=1s = ~11 seconds maximum failover time.
Kubernetes DNS Architecture
ClusterIP Services
A ClusterIP service gets a stable virtual IP address (the ClusterIP) that never changes, even as pods come and go. kube-proxy on each node programs iptables/eBPF rules to load-balance connections to that VIP across healthy pod IPs.
# ClusterIP service for ML inference
apiVersion: v1
kind: Service
metadata:
name: llm-inference
namespace: ml-serving
spec:
selector:
app: llm-inference
type: ClusterIP
ports:
- name: http
port: 8000
targetPort: 8000
- name: grpc
port: 8001
targetPort: 8001
DNS name: llm-inference.ml-serving.svc.cluster.local
Resolves to: the stable ClusterIP (e.g., 10.96.14.23)
Headless Services for StatefulSets
A headless service (ClusterIP: None) does not allocate a virtual IP. Instead, DNS returns the actual pod IPs directly. This is essential for stateful ML workloads (distributed training workers, parameter servers) where clients need to connect to specific pods, not arbitrary ones.
# Headless service for distributed training workers
apiVersion: v1
kind: Service
metadata:
name: trainer
namespace: ml-training
spec:
selector:
app: distributed-trainer
clusterIP: None # Headless - no VIP, return pod IPs directly
ports:
- port: 29500 # NCCL/gloo rendezvous port
DNS records for a StatefulSet with replicas: 4:
trainer-0.trainer.ml-training.svc.cluster.local -> 10.244.0.5
trainer-1.trainer.ml-training.svc.cluster.local -> 10.244.1.7
trainer-2.trainer.ml-training.svc.cluster.local -> 10.244.2.3
trainer-3.trainer.ml-training.svc.cluster.local -> 10.244.0.9
Each pod has a stable, predictable DNS name based on its ordinal. trainer-0 is always rank 0. The torchrun --rdzv_endpoint=trainer-0.trainer.ml-training.svc.cluster.local:29500 will always find the rendezvous server.
Code: Python DNS Resolution and Analysis
"""
dns_diagnostics.py - DNS resolution diagnostics for ML infrastructure.
Useful for debugging service discovery issues in distributed training and inference.
"""
import socket
import time
import dns.resolver # pip install dnspython
import dns.query
import dns.name
from typing import List, Dict, Optional
from dataclasses import dataclass
@dataclass
class DNSRecord:
name: str
record_type: str
value: str
ttl: int
def resolve_service(
hostname: str,
record_type: str = "A",
nameserver: Optional[str] = None,
) -> List[DNSRecord]:
"""
Resolve a DNS name and return structured records.
Works for both Kubernetes internal DNS and external DNS.
"""
resolver = dns.resolver.Resolver()
if nameserver:
resolver.nameservers = [nameserver]
records = []
try:
answers = resolver.resolve(hostname, record_type)
for rdata in answers:
records.append(DNSRecord(
name=hostname,
record_type=record_type,
value=str(rdata),
ttl=answers.ttl,
))
except dns.resolver.NXDOMAIN:
print(f"[dns] NXDOMAIN: {hostname} does not exist")
except dns.resolver.NoAnswer:
print(f"[dns] NoAnswer: {hostname} exists but has no {record_type} records")
except dns.resolver.Timeout:
print(f"[dns] Timeout resolving {hostname}")
return records
def measure_dns_latency(
hostname: str,
n_samples: int = 20,
nameserver: str = "10.96.0.10", # Kubernetes CoreDNS
) -> dict:
"""
Measure DNS resolution latency to detect CoreDNS performance issues.
High DNS latency (>10ms) can significantly impact ML inference throughput
when each request requires a DNS lookup.
"""
latencies = []
resolver = dns.resolver.Resolver()
resolver.nameservers = [nameserver]
resolver.cache = None # Disable caching for accurate measurement
for _ in range(n_samples):
start = time.perf_counter()
try:
resolver.resolve(hostname, "A")
except Exception:
pass
latencies.append((time.perf_counter() - start) * 1000)
import statistics
return {
"hostname": hostname,
"nameserver": nameserver,
"n": n_samples,
"mean_ms": round(statistics.mean(latencies), 2),
"median_ms": round(statistics.median(latencies), 2),
"p99_ms": round(sorted(latencies)[int(n_samples * 0.99)], 2),
"max_ms": round(max(latencies), 2),
}
def discover_kubernetes_endpoints(
service_name: str,
namespace: str,
cluster_domain: str = "cluster.local",
) -> dict:
"""
Discover all endpoints for a Kubernetes service via DNS.
Combines ClusterIP lookup (for load-balanced access) and
headless lookup (for direct pod access).
"""
base = f"{service_name}.{namespace}.svc.{cluster_domain}"
result = {
"service_fqdn": base,
"clusterip_records": [],
"pod_records": [], # From headless service (if available)
"srv_records": [],
}
# ClusterIP record
result["clusterip_records"] = resolve_service(base, "A")
# SRV records for port discovery
for proto in ["_http._tcp", "_grpc._tcp", "_metrics._tcp"]:
srv = resolve_service(f"{proto}.{base}", "SRV")
result["srv_records"].extend(srv)
# Try to get individual pod records (headless service)
# StatefulSet pods follow the pattern: pod-N.service.namespace.svc.cluster.local
pod_idx = 0
while pod_idx < 100: # Safety limit
pod_fqdn = f"{service_name}-{pod_idx}.{base}"
records = resolve_service(pod_fqdn, "A")
if not records:
break
result["pod_records"].extend(records)
pod_idx += 1
return result
def watch_dns_changes(
hostname: str,
interval_seconds: float = 2.0,
duration_seconds: float = 60.0,
):
"""
Watch a DNS record for changes over time.
Useful for monitoring rolling deployments or failover events.
"""
print(f"Watching DNS changes for {hostname} every {interval_seconds}s")
print(f"Duration: {duration_seconds}s\n")
previous_ips = set()
start_time = time.time()
while time.time() - start_time < duration_seconds:
records = resolve_service(hostname, "A")
current_ips = {r.value for r in records}
if current_ips != previous_ips:
timestamp = time.strftime("%H:%M:%S")
added = current_ips - previous_ips
removed = previous_ips - current_ips
if added:
print(f"[{timestamp}] NEW endpoints: {added}")
if removed:
print(f"[{timestamp}] REMOVED endpoints: {removed}")
previous_ips = current_ips
time.sleep(interval_seconds)
def simulate_dns_failover(
primary_hostname: str,
fallback_hostname: str,
timeout_ms: float = 100.0,
) -> str:
"""
DNS-based failover: resolve primary, fall back to secondary on failure.
Common pattern for ML serving multi-region deployments.
"""
# Try primary
try:
records = resolve_service(primary_hostname, "A")
if records:
# Test connectivity before committing
ip = records[0].value
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.settimeout(timeout_ms / 1000)
result = sock.connect_ex((ip, 8000))
sock.close()
if result == 0:
return ip
except Exception as e:
print(f"[failover] Primary {primary_hostname} failed: {e}")
# Fall back to secondary
records = resolve_service(fallback_hostname, "A")
if records:
print(f"[failover] Using fallback: {fallback_hostname} -> {records[0].value}")
return records[0].value
raise RuntimeError(f"Both {primary_hostname} and {fallback_hostname} unavailable")
Consul Architecture
Consul is a service mesh and service discovery tool from HashiCorp. It provides three things that DNS alone cannot: active health checking, multi-datacenter routing, and a key-value store for distributed configuration.
Consul Components
Consul uses two gossip protocols:
- LAN gossip (Serf): Agents in the same datacenter discover each other and share node health
- WAN gossip: Consul servers across datacenters exchange reachability information
Consul Service Registration
"""
consul_service_registration.py - Register and discover ML services with Consul.
Install: pip install python-consul
Consul must be running: consul agent -dev (development mode)
"""
import consul
import time
import socket
import logging
from typing import Optional, List, Dict
from dataclasses import dataclass
@dataclass
class MLServiceEndpoint:
service_id: str
service_name: str
address: str
port: int
tags: List[str]
metadata: Dict[str, str]
class ConsulMLRegistry:
"""
Service registry for ML models using Consul.
Handles registration, health checking, and discovery.
"""
def __init__(self, host: str = "127.0.0.1", port: int = 8500):
self._client = consul.Consul(host=host, port=port)
self._hostname = socket.gethostname()
self._registered_services = []
def register_inference_service(
self,
model_name: str,
model_version: str,
serving_port: int = 8000,
metrics_port: int = 9090,
tags: Optional[List[str]] = None,
) -> str:
"""
Register an ML inference service with Consul.
Health check hits /health endpoint every 10 seconds.
"""
service_id = f"{model_name}-{model_version}-{self._hostname}"
if tags is None:
tags = []
self._client.agent.service.register(
name=f"ml-inference-{model_name}",
service_id=service_id,
address=socket.gethostbyname(self._hostname),
port=serving_port,
tags=[
f"version:{model_version}",
f"model:{model_name}",
] + tags,
# Key-value metadata (Consul 1.6.4+)
meta={
"model_name": model_name,
"model_version": model_version,
"framework": "pytorch",
"gpu_type": "a100",
},
check=consul.Check.http(
url=f"http://{self._hostname}:{serving_port}/health",
interval="10s",
timeout="5s",
deregister="60s", # Auto-deregister if unhealthy for 60s
),
)
# Also register metrics endpoint
self._client.agent.check.register(
name=f"{service_id}-metrics",
check=consul.Check.http(
url=f"http://{self._hostname}:{metrics_port}/metrics",
interval="30s",
),
)
self._registered_services.append(service_id)
logging.info(f"Registered {service_id} with Consul")
return service_id
def discover_healthy_endpoints(
self,
model_name: str,
version_filter: Optional[str] = None,
datacenter: Optional[str] = None,
) -> List[MLServiceEndpoint]:
"""
Discover all healthy instances of a model, optionally filtered by version.
Only returns services that passed their health check.
"""
index, services = self._client.health.service(
service=f"ml-inference-{model_name}",
passing=True, # Only healthy instances
dc=datacenter,
)
endpoints = []
for service_entry in services:
svc = service_entry["Service"]
meta = svc.get("Meta", {})
# Optional version filter
svc_version = meta.get("model_version", "unknown")
if version_filter and svc_version != version_filter:
continue
endpoints.append(MLServiceEndpoint(
service_id=svc["ID"],
service_name=svc["Service"],
address=svc["Address"],
port=svc["Port"],
tags=svc.get("Tags", []),
metadata=meta,
))
return endpoints
def watch_for_changes(
self,
model_name: str,
callback,
timeout: int = 30,
):
"""
Long-poll Consul for endpoint changes. Consul's blocking queries
only return when something changes, avoiding polling overhead.
callback(endpoints) is called whenever the service catalog changes.
"""
index = 0
while True:
try:
# Blocking query: waits up to `timeout` seconds for changes
index, services = self._client.health.service(
service=f"ml-inference-{model_name}",
passing=True,
index=index, # Consul returns immediately if > current index
wait=f"{timeout}s",
)
endpoints = [
MLServiceEndpoint(
service_id=s["Service"]["ID"],
service_name=s["Service"]["Service"],
address=s["Service"]["Address"],
port=s["Service"]["Port"],
tags=s["Service"].get("Tags", []),
metadata=s["Service"].get("Meta", {}),
)
for s in services
]
callback(endpoints)
except Exception as e:
logging.error(f"Consul watch error: {e}")
time.sleep(5)
def store_model_config(self, model_name: str, config: dict):
"""
Store model configuration in Consul KV store.
All serving instances can read this config and react to changes.
"""
import json
key = f"ml/models/{model_name}/config"
self._client.kv.put(key, json.dumps(config))
def get_model_config(self, model_name: str) -> Optional[dict]:
"""Read model configuration from Consul KV."""
import json
key = f"ml/models/{model_name}/config"
index, data = self._client.kv.get(key)
if data is None:
return None
return json.loads(data["Value"])
def deregister_all(self):
"""Deregister all services registered by this instance. Call on shutdown."""
for service_id in self._registered_services:
self._client.agent.service.deregister(service_id)
self._registered_services.clear()
etcd for Distributed Coordination
etcd is a strongly consistent distributed key-value store. Kubernetes uses it as its primary data store for all cluster state. Every Kubernetes object - pods, services, deployments, secrets - is stored in etcd. Service endpoint changes flow through etcd to CoreDNS.
etcd vs ZooKeeper vs Consul
| Feature | etcd | ZooKeeper | Consul |
|---|---|---|---|
| Consensus | Raft | Zab | Raft |
| API | gRPC/HTTP | Custom TCP | HTTP/DNS |
| Watch API | Yes | Yes | Yes (blocking queries) |
| Multi-datacenter | No (single cluster) | No (single cluster) | Yes (native) |
| Health checks | No (external) | No (ephemeral nodes) | Yes (integrated) |
| Service mesh | No | No | Yes |
| Primary use | Kubernetes state | Kafka/Hadoop | Service discovery |
"""
etcd_ml_coordination.py - Using etcd for distributed ML training coordination.
Install: pip install etcd3
Requires: etcd server running (etcd --listen-client-urls http://0.0.0.0:2379)
"""
import etcd3
import json
import time
import threading
from typing import Optional, Callable
class DistributedMLCoordinator:
"""
etcd-based coordinator for distributed ML training jobs.
Used for:
- Rank-to-host mapping (so workers find each other)
- Distributed barrier synchronization
- Leader election for parameter server
- Shared hyperparameter configuration
"""
def __init__(self, etcd_host: str = "localhost", etcd_port: int = 2379):
self._client = etcd3.client(host=etcd_host, port=etcd_port)
self._prefix = "/ml-training"
def register_worker(
self,
job_id: str,
rank: int,
address: str,
port: int,
ttl_seconds: int = 30,
):
"""
Register a training worker with its rank and address.
Uses TTL-based lease so dead workers are automatically cleaned up.
"""
key = f"{self._prefix}/{job_id}/workers/{rank}"
value = json.dumps({"address": address, "port": port, "rank": rank})
# Create lease - key auto-expires if not refreshed
lease = self._client.lease(ttl_seconds)
self._client.put(key, value, lease=lease)
# Start keepalive thread
def keepalive():
while True:
try:
lease.refresh()
time.sleep(ttl_seconds // 3)
except Exception:
break
t = threading.Thread(target=keepalive, daemon=True)
t.start()
return lease
def get_all_workers(self, job_id: str) -> dict:
"""Get address/port for all registered workers in a job."""
prefix = f"{self._prefix}/{job_id}/workers/"
workers = {}
for value, metadata in self._client.get_prefix(prefix):
key = metadata.key.decode()
rank = int(key.split("/")[-1])
worker_info = json.loads(value.decode())
workers[rank] = worker_info
return workers
def wait_for_all_workers(
self,
job_id: str,
expected_count: int,
timeout_seconds: int = 300,
) -> dict:
"""
Block until all `expected_count` workers have registered.
This is the distributed rendezvous - equivalent to torch.distributed's
init_process_group barrier, but implemented over etcd.
"""
deadline = time.time() + timeout_seconds
while time.time() < deadline:
workers = self.get_all_workers(job_id)
if len(workers) == expected_count:
return workers
remaining = deadline - time.time()
print(f"[etcd] Waiting for workers: {len(workers)}/{expected_count}, "
f"{remaining:.0f}s remaining")
time.sleep(2)
raise TimeoutError(
f"Only {len(self.get_all_workers(job_id))}/{expected_count} workers "
f"registered within {timeout_seconds}s"
)
def distributed_barrier(
self,
job_id: str,
barrier_name: str,
rank: int,
world_size: int,
timeout_seconds: int = 60,
):
"""
etcd-based distributed barrier. All ranks must reach this function
before any rank returns. Simpler (slower) than NCCL barrier but
works without GPU communication - useful for checkpoint synchronization.
"""
barrier_key = f"{self._prefix}/{job_id}/barriers/{barrier_name}/{rank}"
self._client.put(barrier_key, "ready")
deadline = time.time() + timeout_seconds
while time.time() < deadline:
count = sum(1 for _ in self._client.get_prefix(
f"{self._prefix}/{job_id}/barriers/{barrier_name}/"
))
if count >= world_size:
return # All ranks reached barrier
time.sleep(0.5)
# Clean up partial barrier
self._client.delete_prefix(
f"{self._prefix}/{job_id}/barriers/{barrier_name}/"
)
raise TimeoutError(f"Barrier {barrier_name} timed out")
def elect_leader(self, job_id: str, candidate_id: str) -> bool:
"""
Simple leader election: first rank to write wins.
Returns True if this candidate became leader, False otherwise.
Used for: selecting which rank saves checkpoints, which aggregates metrics.
"""
leader_key = f"{self._prefix}/{job_id}/leader"
# Conditional put: only write if key doesn't exist
success, _ = self._client.transaction(
compare=[
etcd3.transactions.Version(leader_key) == 0 # Key does not exist
],
success=[
etcd3.transactions.Put(leader_key, candidate_id)
],
failure=[],
)
return success
def watch_config(
self,
job_id: str,
config_key: str,
on_change: Callable[[dict], None],
):
"""
Watch for real-time config changes via etcd watch.
Enables live hyperparameter updates during training.
"""
full_key = f"{self._prefix}/{job_id}/config/{config_key}"
def watch_callback(events):
for event in events:
if isinstance(event, etcd3.events.PutEvent):
config = json.loads(event.value.decode())
on_change(config)
self._client.add_watch_callback(full_key, watch_callback)
Service Discovery Patterns: Client-Side vs Server-Side
Client-side discovery gives the client full control over load balancing strategy (round-robin, least-connections, A/B routing by model version). It requires the client to include service registry logic. Used in: Consul with custom ML routing (route 10% to model v2, 90% to v1).
Server-side discovery is simpler for clients - they just talk to a VIP. The load balancing logic lives in the infrastructure. Used in: Kubernetes ClusterIP services, AWS Application Load Balancer. Most ML serving deployments use server-side discovery.
DNS-Based A/B Routing for ML Models
"""
dns_ab_routing.py - DNS-based A/B model routing using weighted round-robin.
Implements gradual rollout from model-v1 to model-v2 via DNS record weights.
"""
import dns.resolver
import random
from typing import Tuple, Optional
def weighted_endpoint_selection(
model_name: str,
consul_host: str = "127.0.0.1",
consul_dns_port: int = 8600,
) -> Tuple[str, int]:
"""
Use Consul DNS to do weighted service selection.
Consul returns SRV records with weights for A/B routing.
To set up: Register two services with different tags:
- ml-inference-llama3 (tag: stable) -> weight 90
- ml-inference-llama3-v2 (tag: canary) -> weight 10
This implements 90%/10% traffic split at the DNS level.
"""
resolver = dns.resolver.Resolver()
resolver.nameservers = [consul_host]
resolver.port = consul_dns_port
try:
# Consul DNS returns SRV records with weights
answers = resolver.resolve(
f"{model_name}.service.consul",
"SRV"
)
# Weighted random selection
total_weight = sum(a.weight for a in answers)
rand = random.uniform(0, total_weight)
cumulative = 0
for answer in answers:
cumulative += answer.weight
if rand <= cumulative:
# Resolve the target hostname to IP
ip_answers = resolver.resolve(str(answer.target), "A")
ip = str(ip_answers[0])
return ip, answer.port
except Exception as e:
raise RuntimeError(f"Service discovery failed for {model_name}: {e}")
class ABModelRouter:
"""
Route ML inference requests between model versions using DNS-based discovery.
Supports gradual rollout: start at 5%, increase to 100% over time.
"""
def __init__(self, stable_name: str, canary_name: str):
self.stable_name = stable_name
self.canary_name = canary_name
self._canary_pct = 0.0
def set_canary_percentage(self, pct: float):
"""Set what percentage of traffic goes to canary model (0.0-1.0)."""
self._canary_pct = max(0.0, min(1.0, pct))
print(f"[ab-router] Canary traffic: {self._canary_pct*100:.0f}%")
def select_endpoint(self) -> Tuple[str, str]:
"""
Select an endpoint based on canary percentage.
Returns (endpoint_address, model_version) tuple.
"""
if random.random() < self._canary_pct:
ip, port = weighted_endpoint_selection(self.canary_name)
return f"http://{ip}:{port}", "canary"
else:
ip, port = weighted_endpoint_selection(self.stable_name)
return f"http://{ip}:{port}", "stable"
Kubernetes Headless Service for StatefulSet ML Workers
# statefulset-ml-workers.yaml
# Headless service + StatefulSet for distributed training workers
# Each worker gets a stable DNS name: worker-N.ml-workers.training.svc.cluster.local
apiVersion: v1
kind: Service
metadata:
name: ml-workers
namespace: training
labels:
app: distributed-trainer
spec:
clusterIP: None # Headless - DNS returns pod IPs directly
selector:
app: distributed-trainer
ports:
- name: rdzvport
port: 29500 # torch.distributed rendezvous
- name: nccl
port: 29501 # NCCL communication port
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: worker
namespace: training
spec:
serviceName: "ml-workers" # Must match headless service name
replicas: 8
selector:
matchLabels:
app: distributed-trainer
template:
metadata:
labels:
app: distributed-trainer
spec:
containers:
- name: trainer
image: pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime
command:
- torchrun
- --nproc_per_node=1
- --nnodes=8
- --node_rank=$(WORKER_RANK)
# DNS-based rendezvous: worker-0 is always rank 0's hostname
- --rdzv_endpoint=worker-0.ml-workers.training.svc.cluster.local:29500
- --rdzv_backend=c10d
- /workspace/train.py
env:
- name: WORKER_RANK
valueFrom:
fieldRef:
fieldPath: metadata.labels['apps.kubernetes.io/pod-index']
resources:
limits:
nvidia.com/gpu: "1"
Production Engineering Notes
DNS Caching in Python Applications
Python's socket.getaddrinfo() does not cache DNS results by default. Every connection attempt triggers a fresh DNS lookup, adding 1-5ms per request in a Kubernetes cluster. For high-throughput ML inference clients, implement a DNS cache:
import socket
import time
from threading import Lock
from typing import Optional, Tuple
class DNSCache:
"""Simple TTL-based DNS cache to avoid per-request lookups."""
def __init__(self, ttl_seconds: int = 10):
self._cache = {} # hostname -> (ip, expires_at)
self._lock = Lock()
self._ttl = ttl_seconds
def resolve(self, hostname: str, port: int = 0) -> str:
with self._lock:
if hostname in self._cache:
ip, expires_at = self._cache[hostname]
if time.time() < expires_at:
return ip
# Cache miss - do actual DNS lookup
ip = socket.gethostbyname(hostname)
with self._lock:
self._cache[hostname] = (ip, time.time() + self._ttl)
return ip
DNS TTL Tuning for ML Failover
For fast failover in ML serving, you need short DNS TTLs but not so short they overload CoreDNS:
- Normal operation: TTL=30s is fine (reduces DNS load)
- During deployment/migration: Temporarily reduce to TTL=5s before the change, then restore
- For critical inference APIs: Use Kubernetes readiness probes aggressively (every 5s) to remove failed pods from DNS before clients see failures
:::danger CoreDNS as a Single Point of Failure In a default Kubernetes installation, CoreDNS runs as 2 replicas. If both CoreDNS pods are on the same node and that node fails, all DNS resolution in the cluster stops. Every ML service call that requires a DNS lookup will fail, including internal service-to-service calls.
Mitigations:
- Run 3+ CoreDNS replicas with
podAntiAffinityto spread them across nodes - Use
ndots:2in pod DNS config to reduce unnecessary DNS search path lookups - Cache DNS responses in your ML application (TTL-aware cache)
- For critical paths, use ClusterIP addresses directly (hardcode the stable VIP, not the DNS name) as a fallback :::
:::warning DNS Search Domain Amplification
Kubernetes pods have a default ndots:5 DNS search domain configuration. This means that for any hostname with fewer than 5 dots, the resolver tries multiple search domains before attempting a direct lookup:
Query for "redis":
1. redis.ml-training.svc.cluster.local (try)
2. redis.svc.cluster.local (try)
3. redis.cluster.local (try)
4. redis. (try as absolute name)
This multiplies DNS queries by 3-4x for short hostnames. For high-throughput ML serving, use fully-qualified domain names (with trailing dot) in your service configs, or set ndots:2:
# Pod spec
dnsConfig:
options:
- name: ndots
value: "2"
:::
Interview Q&A
Q1: Walk through what happens at the network level when a Kubernetes pod calls socket.connect("llm-inference.ml-serving.svc.cluster.local", 8000).
The pod's stub resolver reads /etc/resolv.conf, which points to CoreDNS at 10.96.0.10. With ndots:5, the resolver first appends the pod's namespace search domain: it queries for llm-inference.ml-serving.svc.cluster.local (5 dots, treated as absolute). CoreDNS receives the UDP DNS query on port 53. CoreDNS checks its cache; on a miss, it looks up the service name in its Kubernetes data store (backed by an API server watch, not a direct etcd query). CoreDNS returns an A record with the ClusterIP (e.g., 10.96.14.23) and a TTL of 5-30 seconds. The pod's TCP SYN goes to 10.96.14.23:8000. kube-proxy's iptables/eBPF rules on the local node intercept this and DNAT it to one of the healthy pod IPs (e.g., 10.244.3.7:8000). The connection reaches the actual pod.
Q2: What is a Kubernetes headless service, when would you use one for ML, and what DNS records does it create?
A headless service (spec.clusterIP: None) does not allocate a virtual IP. Instead, when clients query the DNS name, they get back the actual pod IPs directly (DNS A records for each pod). For a headless service named "trainer" in namespace "training" with 4 pods, DNS returns all 4 pod IPs for a query to trainer.training.svc.cluster.local. Additionally, StatefulSet pods get individual DNS names: trainer-0.trainer.training.svc.cluster.local through trainer-3.trainer.training.svc.cluster.local. Use headless services for distributed training workers (each worker needs a stable, specific hostname for rendezvous), parameter servers (each shard is a specific pod), and any stateful ML workload where you need to address specific instances.
Q3: Explain the difference between client-side and server-side service discovery, and which pattern is better for A/B testing ML model versions.
Client-side discovery: the client queries the service registry directly, gets a list of healthy endpoints, and makes its own load balancing decision. Consul with a custom client library is an example. Server-side discovery: the client calls a stable address (DNS name or VIP), and the infrastructure (kube-proxy, load balancer) routes to a healthy backend. Kubernetes ClusterIP is an example. For A/B testing ML models, client-side discovery is more powerful: the client can inspect endpoint metadata (model version tag) and implement weighted selection (10% to v2, 90% to v1) with precise control. Server-side discovery requires the load balancer to support weighted routing (Envoy's weighted clusters, AWS ALB weighted target groups) - possible but less flexible. The tradeoff: client-side puts routing logic in application code (more control, more complexity); server-side keeps it in infrastructure (less control, less complexity).
Q4: How does etcd's Raft consensus work, and why does Kubernetes use it instead of ZooKeeper?
Raft elects a leader among a cluster of servers. Only the leader accepts write requests; it replicates writes to a majority of followers before acknowledging success. If the leader fails, a new election occurs and a new leader is chosen within a few seconds. Kubernetes chose etcd over ZooKeeper primarily for operational simplicity: etcd uses HTTP/gRPC APIs (easy to debug, integrates naturally with the Go toolchain), has a simple static cluster membership model (no need for ZooKeeper's ensemble configuration), and was designed with Kubernetes' specific use case in mind from the beginning. ZooKeeper uses a custom binary protocol and was designed for the Java ecosystem (Hadoop, Kafka). etcd's Raft implementation is also somewhat easier to reason about than ZooKeeper's Zab protocol.
Q5: A distributed training job uses the hostname trainer-0.trainer.ml-training.svc.cluster.local as the rendezvous endpoint. The trainer-0 pod is rescheduled to a different node with a new IP. Does the training job break?
No - this is the key advantage of headless services with StatefulSets. When trainer-0 is rescheduled, Kubernetes updates the DNS record for trainer-0.trainer.ml-training.svc.cluster.local to point to the new pod IP. The DNS TTL (typically 5-10 seconds) means workers that cached the old IP will get the new IP within 5-10 seconds. If the rendezvous timeout is longer than the TTL (which it should be), the reconnection succeeds automatically. The ordinal suffix (-0) ensures the same logical rank always has the same hostname, regardless of which physical node it runs on. This is why StatefulSets with headless services are the recommended topology for distributed training on Kubernetes.
Q6: What is DNS TTL and how does it affect failover time for ML inference services?
TTL (Time To Live) is the number of seconds a DNS client should cache a record before re-querying. For a Kubernetes service with TTL=30s: if a pod fails, kube-proxy immediately removes it from the load balancing pool (near-instant). However, clients that have cached the ClusterIP for 30 seconds will continue sending to the VIP, which kube-proxy correctly routes away from the failed pod. For DNS-based routing (multiple services, or external DNS), if the DNS record itself changes (e.g., failover to a different regional endpoint), clients may send to the old address for up to TTL seconds. The maximum observed failover time is: TTL + health check interval + connection timeout. For ML serving fast failover: use TTL=5-10s for critical endpoints, run health checks every 5s with a 1s timeout, and configure connection timeouts of 1-2s for maximum failover speed around 11-16 seconds.
