DNS, Service Discovery, and Consul

You deploy a new version of your LLM inference service. Ten pods in Kubernetes. The deployment rolls out. Within 30 seconds, most traffic has migrated to the new pods. But some clients - specifically, a batch processing pipeline running on a separate VM - are still sending requests to the old pods, which are now terminating. Requests are failing. The batch job was configured with a hardcoded IP address from last week.

Hardcoded IP addresses are the original sin of distributed systems. The moment you write 10.0.4.23 in a config file, you are betting that the service at that address will never move, never be replaced, never fail and be rescheduled elsewhere. In a Kubernetes cluster where pods come and go continuously, this bet fails within hours. In a multi-region ML platform where inference replicas are scaled up and down based on GPU availability and cost, it fails within minutes.

The solution - service discovery - is the mechanism by which services find each other without knowing each other's addresses in advance. Instead of an IP, you have a name: llm-inference.ml-serving.svc.cluster.local. That name resolves to whatever IP addresses are currently healthy and running the service. When a pod is replaced, the name continues to work. When the service scales from 10 to 50 replicas, the name automatically distributes load across all 50. When a replica fails its health check, the name stops returning that replica's IP before any client has a chance to send it a request.

DNS is the oldest and most universal form of service discovery - it is the protocol that translates names to addresses, and it has been doing this for the entire internet since 1983. But DNS was designed for relatively static infrastructure. Modern ML platforms need service discovery that can react to changes in seconds, not minutes, and that can encode rich health information beyond just "this IP exists." Consul, etcd, and Kubernetes' built-in DNS (powered by CoreDNS) are the tools that extend DNS with the dynamism and health-awareness that ML infrastructure requires.

This lesson builds a precise understanding of how DNS works at the protocol level, how Kubernetes uses it for service discovery, how Consul extends service discovery with health checking and multi-datacenter routing, and how ML serving systems use these tools to register model endpoints, perform A/B routing, and fail over gracefully. By the end, you will understand how a request for https://inference.production.svc.cluster.local/v1/predict finds the right pod, and what happens when that pod dies mid-request.

Why This Exists

In the early internet, every computer that needed to communicate with every other computer had a file called /etc/hosts that mapped hostnames to IP addresses. The entire internet's address book was distributed as a single text file called HOSTS.TXT, maintained at Stanford Research Institute, and downloaded periodically. This worked fine when there were 200 hosts on the internet. By 1982, the network had grown to thousands of hosts and the centralized file was updated twice a week - meaning addresses were wrong for days at a time, and the network traffic to distribute the file was becoming problematic.

Paul Mockapetris invented DNS (Domain Name System) in 1983, published as RFC 882 and RFC 883. The key insight was to distribute the address book: instead of one central file, each organization manages its own portion of the namespace (a "zone"), and a hierarchical system of servers distributes the lookup process. The design has scaled from thousands to hundreds of millions of domains without fundamental architectural changes.

Service discovery in microservices and distributed systems faces the same problem DNS solved - but at a different time scale. DNS was designed for infrastructure that changes on the scale of days (new servers added, old ones retired). Kubernetes pods change on the scale of seconds (pod crashes, rolling deployments, autoscaling). The modern service discovery stack extends DNS with lower TTLs, active health checking, and real-time catalog updates to match that pace.

Historical Context

DNS was standardized in RFC 1034 and RFC 1035 in 1987, refining the 1983 RFCs. These documents remain the authoritative specification today - a remarkable 37-year-old standard still in use unchanged at its core.

ZooKeeper, developed at Yahoo in 2007, was the first widely deployed distributed coordination service that enabled dynamic service discovery in large clusters. Hadoop clusters used ZooKeeper for leader election and service registry. The pattern spread to HBase, Kafka (still uses ZooKeeper, though migrating to KRaft), and early microservice deployments at Twitter and Netflix.

etcd was developed by CoreOS in 2013 as a simpler alternative to ZooKeeper, using Raft consensus instead of ZooKeeper's Zab protocol. Kubernetes adopted etcd as its primary data store for all cluster state, including service endpoint information.

Consul was released by HashiCorp in 2014, adding multi-datacenter support, integrated health checking, and a DNS interface on top of a distributed key-value store. It became the dominant service mesh and service discovery tool for non-Kubernetes infrastructure.

CoreDNS, adopted as Kubernetes' default DNS server in 1.13 (2019), replaced kube-dns and provides the *.svc.cluster.local DNS resolution that Kubernetes services rely on.

Core Concepts: The DNS Resolution Chain

When a process in a Kubernetes pod resolves llm-inference.ml-serving.svc.cluster.local, here is the exact sequence of events:

DNS Record Types for Service Discovery

Different record types serve different roles in ML infrastructure:

Record	Purpose	ML Use Case
A	Maps hostname to IPv4 address	Pod IP for inference replicas
AAAA	Maps hostname to IPv6 address	IPv6-native clusters
CNAME	Alias: one name to another	`inference-v2.example.com` -> `inference.example.com`
SRV	Service location: name, port, priority, weight	gRPC service port discovery
TXT	Arbitrary text metadata	Model version, capability tags

SRV records are particularly useful for ML serving because they encode both the hostname and port:

# SRV record format: priority weight port target
_grpc._tcp.triton.ml-serving.svc.cluster.local. 300 IN SRV 0 10 8001 triton-0.triton.ml-serving.svc.cluster.local.
_grpc._tcp.triton.ml-serving.svc.cluster.local. 300 IN SRV 0 10 8001 triton-1.triton.ml-serving.svc.cluster.local.

# Client can discover both host AND port from a single DNS lookup
# Weight field (10) enables weighted load balancing via DNS

TTL and Caching

Every DNS record has a TTL (Time To Live) in seconds. The stub resolver on a client caches records for the TTL duration. For service discovery:

TTL too high: clients cache stale addresses, traffic goes to dead pods after failover
TTL too low: high DNS query load, CoreDNS becomes a bottleneck

Kubernetes services use a TTL of 5-30 seconds for ClusterIP services. For headless services (used with StatefulSets), each pod's DNS record has a TTL of 5 seconds by default.

$\text{max failover time} \approx \text{TTL} + \text{health check interval} + \text{connection timeout}$

For fast failover in ML serving: TTL=5s + health check=5s + timeout=1s = ~11 seconds maximum failover time.

Kubernetes DNS Architecture

ClusterIP Services

A ClusterIP service gets a stable virtual IP address (the ClusterIP) that never changes, even as pods come and go. kube-proxy on each node programs iptables/eBPF rules to load-balance connections to that VIP across healthy pod IPs.

# ClusterIP service for ML inference
apiVersion: v1
kind: Service
metadata:
  name: llm-inference
  namespace: ml-serving
spec:
  selector:
    app: llm-inference
  type: ClusterIP
  ports:
  - name: http
    port: 8000
    targetPort: 8000
  - name: grpc
    port: 8001
    targetPort: 8001

DNS name: llm-inference.ml-serving.svc.cluster.local Resolves to: the stable ClusterIP (e.g., 10.96.14.23)

Headless Services for StatefulSets

A headless service (ClusterIP: None) does not allocate a virtual IP. Instead, DNS returns the actual pod IPs directly. This is essential for stateful ML workloads (distributed training workers, parameter servers) where clients need to connect to specific pods, not arbitrary ones.

# Headless service for distributed training workers
apiVersion: v1
kind: Service
metadata:
  name: trainer
  namespace: ml-training
spec:
  selector:
    app: distributed-trainer
  clusterIP: None  # Headless - no VIP, return pod IPs directly
  ports:
  - port: 29500    # NCCL/gloo rendezvous port

DNS records for a StatefulSet with replicas: 4:

trainer-0.trainer.ml-training.svc.cluster.local -> 10.244.0.5
trainer-1.trainer.ml-training.svc.cluster.local -> 10.244.1.7
trainer-2.trainer.ml-training.svc.cluster.local -> 10.244.2.3
trainer-3.trainer.ml-training.svc.cluster.local -> 10.244.0.9

Each pod has a stable, predictable DNS name based on its ordinal. trainer-0 is always rank 0. The torchrun --rdzv_endpoint=trainer-0.trainer.ml-training.svc.cluster.local:29500 will always find the rendezvous server.

Code: Python DNS Resolution and Analysis

"""
dns_diagnostics.py - DNS resolution diagnostics for ML infrastructure.
Useful for debugging service discovery issues in distributed training and inference.
"""

import socket
import time
import dns.resolver      # pip install dnspython
import dns.query
import dns.name
from typing import List, Dict, Optional
from dataclasses import dataclass


@dataclass
class DNSRecord:
    name: str
    record_type: str
    value: str
    ttl: int


def resolve_service(
    hostname: str,
    record_type: str = "A",
    nameserver: Optional[str] = None,
) -> List[DNSRecord]:
    """
    Resolve a DNS name and return structured records.
    Works for both Kubernetes internal DNS and external DNS.
    """
    resolver = dns.resolver.Resolver()
    if nameserver:
        resolver.nameservers = [nameserver]

    records = []
    try:
        answers = resolver.resolve(hostname, record_type)
        for rdata in answers:
            records.append(DNSRecord(
                name=hostname,
                record_type=record_type,
                value=str(rdata),
                ttl=answers.ttl,
            ))
    except dns.resolver.NXDOMAIN:
        print(f"[dns] NXDOMAIN: {hostname} does not exist")
    except dns.resolver.NoAnswer:
        print(f"[dns] NoAnswer: {hostname} exists but has no {record_type} records")
    except dns.resolver.Timeout:
        print(f"[dns] Timeout resolving {hostname}")

    return records


def measure_dns_latency(
    hostname: str,
    n_samples: int = 20,
    nameserver: str = "10.96.0.10",  # Kubernetes CoreDNS
) -> dict:
    """
    Measure DNS resolution latency to detect CoreDNS performance issues.
    High DNS latency (>10ms) can significantly impact ML inference throughput
    when each request requires a DNS lookup.
    """
    latencies = []
    resolver = dns.resolver.Resolver()
    resolver.nameservers = [nameserver]
    resolver.cache = None  # Disable caching for accurate measurement

    for _ in range(n_samples):
        start = time.perf_counter()
        try:
            resolver.resolve(hostname, "A")
        except Exception:
            pass
        latencies.append((time.perf_counter() - start) * 1000)

    import statistics
    return {
        "hostname": hostname,
        "nameserver": nameserver,
        "n": n_samples,
        "mean_ms": round(statistics.mean(latencies), 2),
        "median_ms": round(statistics.median(latencies), 2),
        "p99_ms": round(sorted(latencies)[int(n_samples * 0.99)], 2),
        "max_ms": round(max(latencies), 2),
    }


def discover_kubernetes_endpoints(
    service_name: str,
    namespace: str,
    cluster_domain: str = "cluster.local",
) -> dict:
    """
    Discover all endpoints for a Kubernetes service via DNS.
    Combines ClusterIP lookup (for load-balanced access) and
    headless lookup (for direct pod access).
    """
    base = f"{service_name}.{namespace}.svc.{cluster_domain}"

    result = {
        "service_fqdn": base,
        "clusterip_records": [],
        "pod_records": [],   # From headless service (if available)
        "srv_records": [],
    }

    # ClusterIP record
    result["clusterip_records"] = resolve_service(base, "A")

    # SRV records for port discovery
    for proto in ["_http._tcp", "_grpc._tcp", "_metrics._tcp"]:
        srv = resolve_service(f"{proto}.{base}", "SRV")
        result["srv_records"].extend(srv)

    # Try to get individual pod records (headless service)
    # StatefulSet pods follow the pattern: pod-N.service.namespace.svc.cluster.local
    pod_idx = 0
    while pod_idx < 100:  # Safety limit
        pod_fqdn = f"{service_name}-{pod_idx}.{base}"
        records = resolve_service(pod_fqdn, "A")
        if not records:
            break
        result["pod_records"].extend(records)
        pod_idx += 1

    return result


def watch_dns_changes(
    hostname: str,
    interval_seconds: float = 2.0,
    duration_seconds: float = 60.0,
):
    """
    Watch a DNS record for changes over time.
    Useful for monitoring rolling deployments or failover events.
    """
    print(f"Watching DNS changes for {hostname} every {interval_seconds}s")
    print(f"Duration: {duration_seconds}s\n")

    previous_ips = set()
    start_time = time.time()

    while time.time() - start_time < duration_seconds:
        records = resolve_service(hostname, "A")
        current_ips = {r.value for r in records}

        if current_ips != previous_ips:
            timestamp = time.strftime("%H:%M:%S")
            added = current_ips - previous_ips
            removed = previous_ips - current_ips

            if added:
                print(f"[{timestamp}] NEW endpoints: {added}")
            if removed:
                print(f"[{timestamp}] REMOVED endpoints: {removed}")

            previous_ips = current_ips

        time.sleep(interval_seconds)


def simulate_dns_failover(
    primary_hostname: str,
    fallback_hostname: str,
    timeout_ms: float = 100.0,
) -> str:
    """
    DNS-based failover: resolve primary, fall back to secondary on failure.
    Common pattern for ML serving multi-region deployments.
    """
    # Try primary
    try:
        records = resolve_service(primary_hostname, "A")
        if records:
            # Test connectivity before committing
            ip = records[0].value
            sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
            sock.settimeout(timeout_ms / 1000)
            result = sock.connect_ex((ip, 8000))
            sock.close()
            if result == 0:
                return ip

    except Exception as e:
        print(f"[failover] Primary {primary_hostname} failed: {e}")

    # Fall back to secondary
    records = resolve_service(fallback_hostname, "A")
    if records:
        print(f"[failover] Using fallback: {fallback_hostname} -> {records[0].value}")
        return records[0].value

    raise RuntimeError(f"Both {primary_hostname} and {fallback_hostname} unavailable")

Consul Architecture

Consul is a service mesh and service discovery tool from HashiCorp. It provides three things that DNS alone cannot: active health checking, multi-datacenter routing, and a key-value store for distributed configuration.

Consul Components

Consul uses two gossip protocols:

LAN gossip (Serf): Agents in the same datacenter discover each other and share node health
WAN gossip: Consul servers across datacenters exchange reachability information

Consul Service Registration

"""
consul_service_registration.py - Register and discover ML services with Consul.

Install: pip install python-consul
Consul must be running: consul agent -dev (development mode)
"""

import consul
import time
import socket
import logging
from typing import Optional, List, Dict
from dataclasses import dataclass


@dataclass
class MLServiceEndpoint:
    service_id: str
    service_name: str
    address: str
    port: int
    tags: List[str]
    metadata: Dict[str, str]


class ConsulMLRegistry:
    """
    Service registry for ML models using Consul.
    Handles registration, health checking, and discovery.
    """

    def __init__(self, host: str = "127.0.0.1", port: int = 8500):
        self._client = consul.Consul(host=host, port=port)
        self._hostname = socket.gethostname()
        self._registered_services = []

    def register_inference_service(
        self,
        model_name: str,
        model_version: str,
        serving_port: int = 8000,
        metrics_port: int = 9090,
        tags: Optional[List[str]] = None,
    ) -> str:
        """
        Register an ML inference service with Consul.
        Health check hits /health endpoint every 10 seconds.
        """
        service_id = f"{model_name}-{model_version}-{self._hostname}"

        if tags is None:
            tags = []

        self._client.agent.service.register(
            name=f"ml-inference-{model_name}",
            service_id=service_id,
            address=socket.gethostbyname(self._hostname),
            port=serving_port,
            tags=[
                f"version:{model_version}",
                f"model:{model_name}",
            ] + tags,
            # Key-value metadata (Consul 1.6.4+)
            meta={
                "model_name": model_name,
                "model_version": model_version,
                "framework": "pytorch",
                "gpu_type": "a100",
            },
            check=consul.Check.http(
                url=f"http://{self._hostname}:{serving_port}/health",
                interval="10s",
                timeout="5s",
                deregister="60s",  # Auto-deregister if unhealthy for 60s
            ),
        )

        # Also register metrics endpoint
        self._client.agent.check.register(
            name=f"{service_id}-metrics",
            check=consul.Check.http(
                url=f"http://{self._hostname}:{metrics_port}/metrics",
                interval="30s",
            ),
        )

        self._registered_services.append(service_id)
        logging.info(f"Registered {service_id} with Consul")
        return service_id

    def discover_healthy_endpoints(
        self,
        model_name: str,
        version_filter: Optional[str] = None,
        datacenter: Optional[str] = None,
    ) -> List[MLServiceEndpoint]:
        """
        Discover all healthy instances of a model, optionally filtered by version.
        Only returns services that passed their health check.
        """
        index, services = self._client.health.service(
            service=f"ml-inference-{model_name}",
            passing=True,  # Only healthy instances
            dc=datacenter,
        )

        endpoints = []
        for service_entry in services:
            svc = service_entry["Service"]
            meta = svc.get("Meta", {})

            # Optional version filter
            svc_version = meta.get("model_version", "unknown")
            if version_filter and svc_version != version_filter:
                continue

            endpoints.append(MLServiceEndpoint(
                service_id=svc["ID"],
                service_name=svc["Service"],
                address=svc["Address"],
                port=svc["Port"],
                tags=svc.get("Tags", []),
                metadata=meta,
            ))

        return endpoints

    def watch_for_changes(
        self,
        model_name: str,
        callback,
        timeout: int = 30,
    ):
        """
        Long-poll Consul for endpoint changes. Consul's blocking queries
        only return when something changes, avoiding polling overhead.
        callback(endpoints) is called whenever the service catalog changes.
        """
        index = 0
        while True:
            try:
                # Blocking query: waits up to `timeout` seconds for changes
                index, services = self._client.health.service(
                    service=f"ml-inference-{model_name}",
                    passing=True,
                    index=index,    # Consul returns immediately if > current index
                    wait=f"{timeout}s",
                )
                endpoints = [
                    MLServiceEndpoint(
                        service_id=s["Service"]["ID"],
                        service_name=s["Service"]["Service"],
                        address=s["Service"]["Address"],
                        port=s["Service"]["Port"],
                        tags=s["Service"].get("Tags", []),
                        metadata=s["Service"].get("Meta", {}),
                    )
                    for s in services
                ]
                callback(endpoints)
            except Exception as e:
                logging.error(f"Consul watch error: {e}")
                time.sleep(5)

    def store_model_config(self, model_name: str, config: dict):
        """
        Store model configuration in Consul KV store.
        All serving instances can read this config and react to changes.
        """
        import json
        key = f"ml/models/{model_name}/config"
        self._client.kv.put(key, json.dumps(config))

    def get_model_config(self, model_name: str) -> Optional[dict]:
        """Read model configuration from Consul KV."""
        import json
        key = f"ml/models/{model_name}/config"
        index, data = self._client.kv.get(key)
        if data is None:
            return None
        return json.loads(data["Value"])

    def deregister_all(self):
        """Deregister all services registered by this instance. Call on shutdown."""
        for service_id in self._registered_services:
            self._client.agent.service.deregister(service_id)
        self._registered_services.clear()

etcd for Distributed Coordination

etcd is a strongly consistent distributed key-value store. Kubernetes uses it as its primary data store for all cluster state. Every Kubernetes object - pods, services, deployments, secrets - is stored in etcd. Service endpoint changes flow through etcd to CoreDNS.

etcd vs ZooKeeper vs Consul

Feature	etcd	ZooKeeper	Consul
Consensus	Raft	Zab	Raft
API	gRPC/HTTP	Custom TCP	HTTP/DNS
Watch API	Yes	Yes	Yes (blocking queries)
Multi-datacenter	No (single cluster)	No (single cluster)	Yes (native)
Health checks	No (external)	No (ephemeral nodes)	Yes (integrated)
Service mesh	No	No	Yes
Primary use	Kubernetes state	Kafka/Hadoop	Service discovery

"""
etcd_ml_coordination.py - Using etcd for distributed ML training coordination.

Install: pip install etcd3
Requires: etcd server running (etcd --listen-client-urls http://0.0.0.0:2379)
"""

import etcd3
import json
import time
import threading
from typing import Optional, Callable


class DistributedMLCoordinator:
    """
    etcd-based coordinator for distributed ML training jobs.
    Used for:
    - Rank-to-host mapping (so workers find each other)
    - Distributed barrier synchronization
    - Leader election for parameter server
    - Shared hyperparameter configuration
    """

    def __init__(self, etcd_host: str = "localhost", etcd_port: int = 2379):
        self._client = etcd3.client(host=etcd_host, port=etcd_port)
        self._prefix = "/ml-training"

    def register_worker(
        self,
        job_id: str,
        rank: int,
        address: str,
        port: int,
        ttl_seconds: int = 30,
    ):
        """
        Register a training worker with its rank and address.
        Uses TTL-based lease so dead workers are automatically cleaned up.
        """
        key = f"{self._prefix}/{job_id}/workers/{rank}"
        value = json.dumps({"address": address, "port": port, "rank": rank})

        # Create lease - key auto-expires if not refreshed
        lease = self._client.lease(ttl_seconds)
        self._client.put(key, value, lease=lease)

        # Start keepalive thread
        def keepalive():
            while True:
                try:
                    lease.refresh()
                    time.sleep(ttl_seconds // 3)
                except Exception:
                    break

        t = threading.Thread(target=keepalive, daemon=True)
        t.start()

        return lease

    def get_all_workers(self, job_id: str) -> dict:
        """Get address/port for all registered workers in a job."""
        prefix = f"{self._prefix}/{job_id}/workers/"
        workers = {}

        for value, metadata in self._client.get_prefix(prefix):
            key = metadata.key.decode()
            rank = int(key.split("/")[-1])
            worker_info = json.loads(value.decode())
            workers[rank] = worker_info

        return workers

    def wait_for_all_workers(
        self,
        job_id: str,
        expected_count: int,
        timeout_seconds: int = 300,
    ) -> dict:
        """
        Block until all `expected_count` workers have registered.
        This is the distributed rendezvous - equivalent to torch.distributed's
        init_process_group barrier, but implemented over etcd.
        """
        deadline = time.time() + timeout_seconds

        while time.time() < deadline:
            workers = self.get_all_workers(job_id)
            if len(workers) == expected_count:
                return workers
            remaining = deadline - time.time()
            print(f"[etcd] Waiting for workers: {len(workers)}/{expected_count}, "
                  f"{remaining:.0f}s remaining")
            time.sleep(2)

        raise TimeoutError(
            f"Only {len(self.get_all_workers(job_id))}/{expected_count} workers "
            f"registered within {timeout_seconds}s"
        )

    def distributed_barrier(
        self,
        job_id: str,
        barrier_name: str,
        rank: int,
        world_size: int,
        timeout_seconds: int = 60,
    ):
        """
        etcd-based distributed barrier. All ranks must reach this function
        before any rank returns. Simpler (slower) than NCCL barrier but
        works without GPU communication - useful for checkpoint synchronization.
        """
        barrier_key = f"{self._prefix}/{job_id}/barriers/{barrier_name}/{rank}"
        self._client.put(barrier_key, "ready")

        deadline = time.time() + timeout_seconds
        while time.time() < deadline:
            count = sum(1 for _ in self._client.get_prefix(
                f"{self._prefix}/{job_id}/barriers/{barrier_name}/"
            ))
            if count >= world_size:
                return  # All ranks reached barrier
            time.sleep(0.5)

        # Clean up partial barrier
        self._client.delete_prefix(
            f"{self._prefix}/{job_id}/barriers/{barrier_name}/"
        )
        raise TimeoutError(f"Barrier {barrier_name} timed out")

    def elect_leader(self, job_id: str, candidate_id: str) -> bool:
        """
        Simple leader election: first rank to write wins.
        Returns True if this candidate became leader, False otherwise.
        Used for: selecting which rank saves checkpoints, which aggregates metrics.
        """
        leader_key = f"{self._prefix}/{job_id}/leader"

        # Conditional put: only write if key doesn't exist
        success, _ = self._client.transaction(
            compare=[
                etcd3.transactions.Version(leader_key) == 0  # Key does not exist
            ],
            success=[
                etcd3.transactions.Put(leader_key, candidate_id)
            ],
            failure=[],
        )

        return success

    def watch_config(
        self,
        job_id: str,
        config_key: str,
        on_change: Callable[[dict], None],
    ):
        """
        Watch for real-time config changes via etcd watch.
        Enables live hyperparameter updates during training.
        """
        full_key = f"{self._prefix}/{job_id}/config/{config_key}"

        def watch_callback(events):
            for event in events:
                if isinstance(event, etcd3.events.PutEvent):
                    config = json.loads(event.value.decode())
                    on_change(config)

        self._client.add_watch_callback(full_key, watch_callback)

Service Discovery Patterns: Client-Side vs Server-Side

Client-side discovery gives the client full control over load balancing strategy (round-robin, least-connections, A/B routing by model version). It requires the client to include service registry logic. Used in: Consul with custom ML routing (route 10% to model v2, 90% to v1).

Server-side discovery is simpler for clients - they just talk to a VIP. The load balancing logic lives in the infrastructure. Used in: Kubernetes ClusterIP services, AWS Application Load Balancer. Most ML serving deployments use server-side discovery.

DNS-Based A/B Routing for ML Models

"""
dns_ab_routing.py - DNS-based A/B model routing using weighted round-robin.
Implements gradual rollout from model-v1 to model-v2 via DNS record weights.
"""

import dns.resolver
import random
from typing import Tuple, Optional


def weighted_endpoint_selection(
    model_name: str,
    consul_host: str = "127.0.0.1",
    consul_dns_port: int = 8600,
) -> Tuple[str, int]:
    """
    Use Consul DNS to do weighted service selection.
    Consul returns SRV records with weights for A/B routing.

    To set up: Register two services with different tags:
    - ml-inference-llama3 (tag: stable) -> weight 90
    - ml-inference-llama3-v2 (tag: canary) -> weight 10

    This implements 90%/10% traffic split at the DNS level.
    """
    resolver = dns.resolver.Resolver()
    resolver.nameservers = [consul_host]
    resolver.port = consul_dns_port

    try:
        # Consul DNS returns SRV records with weights
        answers = resolver.resolve(
            f"{model_name}.service.consul",
            "SRV"
        )

        # Weighted random selection
        total_weight = sum(a.weight for a in answers)
        rand = random.uniform(0, total_weight)

        cumulative = 0
        for answer in answers:
            cumulative += answer.weight
            if rand <= cumulative:
                # Resolve the target hostname to IP
                ip_answers = resolver.resolve(str(answer.target), "A")
                ip = str(ip_answers[0])
                return ip, answer.port

    except Exception as e:
        raise RuntimeError(f"Service discovery failed for {model_name}: {e}")


class ABModelRouter:
    """
    Route ML inference requests between model versions using DNS-based discovery.
    Supports gradual rollout: start at 5%, increase to 100% over time.
    """

    def __init__(self, stable_name: str, canary_name: str):
        self.stable_name = stable_name
        self.canary_name = canary_name
        self._canary_pct = 0.0

    def set_canary_percentage(self, pct: float):
        """Set what percentage of traffic goes to canary model (0.0-1.0)."""
        self._canary_pct = max(0.0, min(1.0, pct))
        print(f"[ab-router] Canary traffic: {self._canary_pct*100:.0f}%")

    def select_endpoint(self) -> Tuple[str, str]:
        """
        Select an endpoint based on canary percentage.
        Returns (endpoint_address, model_version) tuple.
        """
        if random.random() < self._canary_pct:
            ip, port = weighted_endpoint_selection(self.canary_name)
            return f"http://{ip}:{port}", "canary"
        else:
            ip, port = weighted_endpoint_selection(self.stable_name)
            return f"http://{ip}:{port}", "stable"

Kubernetes Headless Service for StatefulSet ML Workers

# statefulset-ml-workers.yaml
# Headless service + StatefulSet for distributed training workers
# Each worker gets a stable DNS name: worker-N.ml-workers.training.svc.cluster.local

apiVersion: v1
kind: Service
metadata:
  name: ml-workers
  namespace: training
  labels:
    app: distributed-trainer
spec:
  clusterIP: None  # Headless - DNS returns pod IPs directly
  selector:
    app: distributed-trainer
  ports:
  - name: rdzvport
    port: 29500       # torch.distributed rendezvous
  - name: nccl
    port: 29501       # NCCL communication port
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: worker
  namespace: training
spec:
  serviceName: "ml-workers"   # Must match headless service name
  replicas: 8
  selector:
    matchLabels:
      app: distributed-trainer
  template:
    metadata:
      labels:
        app: distributed-trainer
    spec:
      containers:
      - name: trainer
        image: pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime
        command:
        - torchrun
        - --nproc_per_node=1
        - --nnodes=8
        - --node_rank=$(WORKER_RANK)
        # DNS-based rendezvous: worker-0 is always rank 0's hostname
        - --rdzv_endpoint=worker-0.ml-workers.training.svc.cluster.local:29500
        - --rdzv_backend=c10d
        - /workspace/train.py
        env:
        - name: WORKER_RANK
          valueFrom:
            fieldRef:
              fieldPath: metadata.labels['apps.kubernetes.io/pod-index']
        resources:
          limits:
            nvidia.com/gpu: "1"

Production Engineering Notes

DNS Caching in Python Applications

Python's socket.getaddrinfo() does not cache DNS results by default. Every connection attempt triggers a fresh DNS lookup, adding 1-5ms per request in a Kubernetes cluster. For high-throughput ML inference clients, implement a DNS cache:

import socket
import time
from threading import Lock
from typing import Optional, Tuple

class DNSCache:
    """Simple TTL-based DNS cache to avoid per-request lookups."""

    def __init__(self, ttl_seconds: int = 10):
        self._cache = {}  # hostname -> (ip, expires_at)
        self._lock = Lock()
        self._ttl = ttl_seconds

    def resolve(self, hostname: str, port: int = 0) -> str:
        with self._lock:
            if hostname in self._cache:
                ip, expires_at = self._cache[hostname]
                if time.time() < expires_at:
                    return ip

        # Cache miss - do actual DNS lookup
        ip = socket.gethostbyname(hostname)

        with self._lock:
            self._cache[hostname] = (ip, time.time() + self._ttl)

        return ip

DNS TTL Tuning for ML Failover

For fast failover in ML serving, you need short DNS TTLs but not so short they overload CoreDNS:

Normal operation: TTL=30s is fine (reduces DNS load)
During deployment/migration: Temporarily reduce to TTL=5s before the change, then restore
For critical inference APIs: Use Kubernetes readiness probes aggressively (every 5s) to remove failed pods from DNS before clients see failures

:::danger CoreDNS as a Single Point of Failure In a default Kubernetes installation, CoreDNS runs as 2 replicas. If both CoreDNS pods are on the same node and that node fails, all DNS resolution in the cluster stops. Every ML service call that requires a DNS lookup will fail, including internal service-to-service calls.

Mitigations:

Run 3+ CoreDNS replicas with podAntiAffinity to spread them across nodes
Use ndots:2 in pod DNS config to reduce unnecessary DNS search path lookups
Cache DNS responses in your ML application (TTL-aware cache)
For critical paths, use ClusterIP addresses directly (hardcode the stable VIP, not the DNS name) as a fallback :::

:::warning DNS Search Domain Amplification Kubernetes pods have a default ndots:5 DNS search domain configuration. This means that for any hostname with fewer than 5 dots, the resolver tries multiple search domains before attempting a direct lookup:

Query for "redis":
redis.ml-training.svc.cluster.local  (try)
redis.svc.cluster.local              (try)
redis.cluster.local                  (try)
redis.                               (try as absolute name)

This multiplies DNS queries by 3-4x for short hostnames. For high-throughput ML serving, use fully-qualified domain names (with trailing dot) in your service configs, or set ndots:2:

# Pod spec
dnsConfig:
  options:
  - name: ndots
    value: "2"

:::

Interview Q&A

Q1: Walk through what happens at the network level when a Kubernetes pod calls socket.connect("llm-inference.ml-serving.svc.cluster.local", 8000).

The pod's stub resolver reads /etc/resolv.conf, which points to CoreDNS at 10.96.0.10. With ndots:5, the resolver first appends the pod's namespace search domain: it queries for llm-inference.ml-serving.svc.cluster.local (5 dots, treated as absolute). CoreDNS receives the UDP DNS query on port 53. CoreDNS checks its cache; on a miss, it looks up the service name in its Kubernetes data store (backed by an API server watch, not a direct etcd query). CoreDNS returns an A record with the ClusterIP (e.g., 10.96.14.23) and a TTL of 5-30 seconds. The pod's TCP SYN goes to 10.96.14.23:8000. kube-proxy's iptables/eBPF rules on the local node intercept this and DNAT it to one of the healthy pod IPs (e.g., 10.244.3.7:8000). The connection reaches the actual pod.

Q2: What is a Kubernetes headless service, when would you use one for ML, and what DNS records does it create?

A headless service (spec.clusterIP: None) does not allocate a virtual IP. Instead, when clients query the DNS name, they get back the actual pod IPs directly (DNS A records for each pod). For a headless service named "trainer" in namespace "training" with 4 pods, DNS returns all 4 pod IPs for a query to trainer.training.svc.cluster.local. Additionally, StatefulSet pods get individual DNS names: trainer-0.trainer.training.svc.cluster.local through trainer-3.trainer.training.svc.cluster.local. Use headless services for distributed training workers (each worker needs a stable, specific hostname for rendezvous), parameter servers (each shard is a specific pod), and any stateful ML workload where you need to address specific instances.

Q3: Explain the difference between client-side and server-side service discovery, and which pattern is better for A/B testing ML model versions.

Client-side discovery: the client queries the service registry directly, gets a list of healthy endpoints, and makes its own load balancing decision. Consul with a custom client library is an example. Server-side discovery: the client calls a stable address (DNS name or VIP), and the infrastructure (kube-proxy, load balancer) routes to a healthy backend. Kubernetes ClusterIP is an example. For A/B testing ML models, client-side discovery is more powerful: the client can inspect endpoint metadata (model version tag) and implement weighted selection (10% to v2, 90% to v1) with precise control. Server-side discovery requires the load balancer to support weighted routing (Envoy's weighted clusters, AWS ALB weighted target groups) - possible but less flexible. The tradeoff: client-side puts routing logic in application code (more control, more complexity); server-side keeps it in infrastructure (less control, less complexity).

Q4: How does etcd's Raft consensus work, and why does Kubernetes use it instead of ZooKeeper?

Raft elects a leader among a cluster of servers. Only the leader accepts write requests; it replicates writes to a majority of followers before acknowledging success. If the leader fails, a new election occurs and a new leader is chosen within a few seconds. Kubernetes chose etcd over ZooKeeper primarily for operational simplicity: etcd uses HTTP/gRPC APIs (easy to debug, integrates naturally with the Go toolchain), has a simple static cluster membership model (no need for ZooKeeper's ensemble configuration), and was designed with Kubernetes' specific use case in mind from the beginning. ZooKeeper uses a custom binary protocol and was designed for the Java ecosystem (Hadoop, Kafka). etcd's Raft implementation is also somewhat easier to reason about than ZooKeeper's Zab protocol.

Q5: A distributed training job uses the hostname trainer-0.trainer.ml-training.svc.cluster.local as the rendezvous endpoint. The trainer-0 pod is rescheduled to a different node with a new IP. Does the training job break?

No - this is the key advantage of headless services with StatefulSets. When trainer-0 is rescheduled, Kubernetes updates the DNS record for trainer-0.trainer.ml-training.svc.cluster.local to point to the new pod IP. The DNS TTL (typically 5-10 seconds) means workers that cached the old IP will get the new IP within 5-10 seconds. If the rendezvous timeout is longer than the TTL (which it should be), the reconnection succeeds automatically. The ordinal suffix (-0) ensures the same logical rank always has the same hostname, regardless of which physical node it runs on. This is why StatefulSets with headless services are the recommended topology for distributed training on Kubernetes.

Q6: What is DNS TTL and how does it affect failover time for ML inference services?

TTL (Time To Live) is the number of seconds a DNS client should cache a record before re-querying. For a Kubernetes service with TTL=30s: if a pod fails, kube-proxy immediately removes it from the load balancing pool (near-instant). However, clients that have cached the ClusterIP for 30 seconds will continue sending to the VIP, which kube-proxy correctly routes away from the failed pod. For DNS-based routing (multiple services, or external DNS), if the DNS record itself changes (e.g., failover to a different regional endpoint), clients may send to the old address for up to TTL seconds. The maximum observed failover time is: TTL + health check interval + connection timeout. For ML serving fast failover: use TTL=5-10s for critical endpoints, run health checks every 5s with a 1s timeout, and configure connection timeouts of 1-2s for maximum failover speed around 11-16 seconds.

Why This Exists​

Historical Context​

Core Concepts: The DNS Resolution Chain​

DNS Record Types for Service Discovery​

TTL and Caching​

Kubernetes DNS Architecture​

ClusterIP Services​

Headless Services for StatefulSets​

Code: Python DNS Resolution and Analysis​

Consul Architecture​

Consul Components​

Consul Service Registration​

etcd for Distributed Coordination​

etcd vs ZooKeeper vs Consul​

Service Discovery Patterns: Client-Side vs Server-Side​

DNS-Based A/B Routing for ML Models​

Kubernetes Headless Service for StatefulSet ML Workers​

Production Engineering Notes​

DNS Caching in Python Applications​

DNS TTL Tuning for ML Failover​

Interview Q&A​