What is data serialization?

Why serialization format is an architectural decision - JSON vs Protocol Buffers vs Avro, schema evolution strategies, and how Confluent Schema Registry prevents breaking production pipelines.

How does protocol buffers work in practice?

Data Serialization and Schemas covers data serialization, protocol buffers, apache avro from first principles with code examples. Free lesson at https://engineersofai.com/docs/data-engineering/foundations/data-serialization-and-schemas

What is the difference between data serialization and apache avro?

See the full breakdown at https://engineersofai.com/docs/data-engineering/foundations/data-serialization-and-schemas

:::tip 🎮 Interactive Playground Visualize this concept: Try the Storage Formats Compared demo on the EngineersOfAI Playground - no code required. :::

Data Serialization and Schemas

Reading time: 35–40 min | Relevance: High for data engineers, ML engineers, platform engineers | Target roles: Data Engineer, ML Engineer, Backend Engineer, AI Platform Engineer

The Incident at 2 AM

It's 2:17 AM and your phone is screaming. You're on the on-call rotation for Spotify's recommendation pipeline - 400 million users, 80 million tracks, recommendations refreshed every 30 minutes. The Kafka consumer processing user listening events has fallen behind. The lag is climbing: 10,000 messages, 100,000, now 500,000. At this rate, recommendations will be stale within the hour and A/B tests will be invalid.

You SSH in, check the logs, and see it immediately: avro.io.AvroTypeException: The datum {"user_id": 19284710, "track_id": "spotify:track:4u7EnebtmKWzUH433cf5Qv", "country_code": "US", "platform": "mobile", "duration_ms": 187000, "skip_at_ms": null} is not an example of schema {"type": "record", "name": "ListenEvent", ...}. Someone on the events team added a platform field to the ListenEvent schema. The producer is publishing events with the new schema. Your consumer has the old schema compiled in. Every message is being rejected and the consumer is crashing in a retry loop.

You look at the Schema Registry dashboard. There it is: schema version 7 was registered three hours ago for the listen-events-value subject. Compatibility was set to BACKWARD - which means new schemas must be readable by old consumers. But BACKWARD compatibility requires new fields to have defaults. The platform field was added without a default value. The Schema Registry should have blocked this registration. You check the team's CI config: they disabled schema compatibility checks two weeks ago to "move faster." Now 500,000 messages are piling up.

This is the serialization and schema problem in its sharpest form. It is not an academic concern about bytes and formats. It is an operational reality that determines whether your data pipeline stays up when a field gets added or renamed, whether your ML training jobs process valid data, and whether your streaming consumers can be deployed independently from producers. The choice of serialization format - and the discipline around schema management - is one of the most consequential architectural decisions in data engineering.

By the time you finish this lesson, you will understand exactly why this incident happened, how schema registries are supposed to prevent it, and how to design systems where schema evolution never causes 2 AM pages.

Why This Exists

In the beginning, there was text. Systems communicated by writing fields to flat files, comma-separated, one record per line. Parsing was manual: split(","), cast to the right type, pray that nothing contained a comma. When systems needed to talk over networks, they sent text. HTTP APIs returned HTML. Later, they returned XML - which at least had structure, though parsing XML required a PhD in ElementTree and two hours of debugging namespace issues.

XML failed at scale. It was verbose (opening and closing tags doubled the payload size), hard to parse, and had no native type system. JSON replaced it in the mid-2000s and became the universal language of web APIs. JSON is human-readable, easy to generate and parse in every language, and maps directly to the data structures engineers already think in. For a long time, JSON was sufficient.

But JSON has a fundamental flaw: it has no schema. A JSON payload is just a bag of key-value pairs. There is nothing stopping a producer from sending {"user_id": "abc123"} when consumers expect {"user_id": 19284710}. There is nothing stopping a producer from renaming track_id to trackId without warning consumers. There is nothing to say which fields are required vs optional, or what their types are. Every consumer has to do defensive programming - checking for missing fields, handling type coercions, dealing with nulls where they did not expect nulls.

At Spotify scale - billions of events per day - the cost of JSON's verbosity and type-unsafety became critical. Each field name is repeated in every record ("user_id", "track_id", etc.). For 1 billion events, you are storing the string "user_id" 1 billion times. Binary formats like Protocol Buffers and Avro solve this: they encode the schema once (or store it separately) and use compact binary representations for the data itself. The result is payloads 5–10x smaller, serialization 5–10x faster, and type safety enforced at compile time or read time.

The Kafka incident above happened because the team chose Avro (correctly) but then bypassed the schema registry's safety checks (incorrectly). Schema management is the discipline that makes schema evolution - adding fields, removing fields, changing types - safe and predictable.

Historical Context

JSON emerged from Douglas Crockford's work in 2001, formalized in RFC 4627 in 2006. It won the format wars against XML because it was simpler and mapped naturally to JavaScript object literals. By 2010, JSON was the default for REST APIs everywhere.

Protocol Buffers (protobuf) was developed internally at Google starting around 2001, open-sourced in 2008. Google needed a way to serialize structured data for RPC communication between services. The key insight was: if you have a schema, you do not need to include field names in the serialized data. Instead, use small integer field numbers. Field 1 is user_id, field 2 is track_id. The schema is shared between producer and consumer, and the wire format only contains field numbers and values. This made protobuf payloads dramatically smaller than XML or JSON.

Apache Avro was created by Doug Cutting (creator of Hadoop) in 2009 as part of the Hadoop ecosystem. Avro's key insight was different from protobuf: rather than using field numbers, store the full schema alongside the data. This makes Avro self-describing - you can decode any Avro file if you have the schema embedded in it. For Kafka specifically, Avro became dominant because Jay Kreps and the Confluent team built the Schema Registry around it, solving the schema-distribution problem that protobuf leaves to you.

Apache Thrift was developed at Facebook in 2007, open-sourced in 2008. It predates protobuf's open-source release by months and took a similar approach: IDL-defined schemas, binary serialization, code generation. Thrift is still used internally at Facebook/Meta and in parts of the Hadoop ecosystem (HBase uses Thrift for its REST gateway). It has fallen behind protobuf in ecosystem breadth.

MessagePack emerged around 2010 as "JSON but binary" - same data model as JSON (no schema required), but encoded in a compact binary format. It is popular in gaming backends and high-frequency messaging systems where JSON semantics are needed but JSON's verbosity is a problem.

Apache Parquet and ORC are columnar storage formats designed for analytical workloads - reading a subset of columns without scanning entire rows. They are not wire formats for streaming but rather file formats for data at rest. They integrate deeply with Avro and protobuf schemas.

Core Concepts

What Serialization Actually Does

Serialization is the process of converting an in-memory data structure into a sequence of bytes that can be stored or transmitted. Deserialization is the reverse. Every time data crosses a boundary - network, disk, process - it must be serialized.

The choice of serialization format determines:

Wire size: How many bytes does a record take? Smaller = cheaper storage, faster network transfer
Serialization speed: How fast can you encode/decode? Slow serialization becomes a bottleneck in high-throughput pipelines
Type safety: Are types enforced at write time, read time, or not at all?
Schema evolution: Can you safely add or remove fields without breaking existing producers/consumers?
Human readability: Can you inspect a record without special tools?
Language support: Does the format have mature libraries in your tech stack?

JSON: The Ubiquitous Default

JSON encodes data as text. A record looks like:

{
  "user_id": 19284710,
  "track_id": "spotify:track:4u7EnebtmKWzUH433cf5Qv",
  "duration_ms": 187000,
  "skip_at_ms": null,
  "timestamp": "2024-01-15T02:17:00Z"
}

The entire field name is included in every record. For 1 billion such records, you are storing the string "user_id" 1 billion times. JSON also has a limited type system: strings, numbers, booleans, null, arrays, objects. There is no distinction between int32 and int64, no native datetime type, no bytes type. This creates ambiguity - is "2024-01-15T02:17:00Z" a timestamp or just a string?

When JSON is appropriate:

REST API responses consumed by browsers or mobile clients
Config files that humans need to read and edit
Debugging and logging (human-readable is valuable)
Low-volume messaging where developer ergonomics matter more than performance
Systems with no shared schema discipline (external APIs, third-party integrations)

When JSON is not appropriate:

High-throughput Kafka topics (millions of events/second)
Large-scale batch processing (billions of records)
Any system where schema drift would cause silent data corruption
ML training pipelines where type correctness is critical

Protocol Buffers: Binary with Strong Types

Protobuf uses an Interface Definition Language (IDL) to define schemas, then generates code from those schemas:

syntax = "proto3";

package spotify.events;

message ListenEvent {
  int64 user_id = 1;
  string track_id = 2;
  int32 duration_ms = 3;
  int32 skip_at_ms = 4;  // 0 means no skip
  int64 timestamp_ms = 5;
  string country_code = 6;
}

The numbers after each field (= 1, = 2, etc.) are field numbers - the key to protobuf's wire format and schema evolution. The wire format does not include field names at all. Instead, each field is encoded as a (field_number, wire_type, value) tuple.

Wire types in protobuf:

Wire Type	ID	Used For
Varint	0	int32, int64, bool, enum
64-bit	1	fixed64, double
Length-delimited	2	string, bytes, embedded messages
32-bit	5	fixed32, float

A varint-encoded user_id = 19284710 takes 4 bytes. The same value as a JSON string "user_id": 19284710 takes 21 bytes. For a 10-field record with mostly integer and string data, protobuf is typically 3–5x smaller than JSON.

Schema evolution with protobuf:

Adding a new field: Always safe. Old consumers that don't know field 7 will ignore it. New consumers that expect field 7 will get the zero value if it is absent.
Removing a field: Safe if you reserve the field number. reserved 4; prevents future use of field number 4, avoiding conflicts with old data.
Renaming a field: Safe - names are not in the wire format. Old data is still readable.
Changing a field type: Dangerous. Changing int32 to int64 is wire-compatible (both use varint) but other type changes are not.

// Adding a new field safely
message ListenEvent {
  int64 user_id = 1;
  string track_id = 2;
  int32 duration_ms = 3;
  reserved 4;  // was skip_at_ms, removed in v2
  int64 timestamp_ms = 5;
  string country_code = 6;
  string platform = 7;  // new field, safe to add
}

Apache Avro: Schema-in-Band

Avro takes a different approach. Instead of field numbers, Avro uses the full schema to encode and decode data. The schema is written in JSON:

{
  "type": "record",
  "name": "ListenEvent",
  "namespace": "com.spotify.events",
  "fields": [
    {"name": "user_id", "type": "long"},
    {"name": "track_id", "type": "string"},
    {"name": "duration_ms", "type": "int"},
    {"name": "skip_at_ms", "type": ["null", "int"], "default": null},
    {"name": "timestamp_ms", "type": "long"},
    {"name": "country_code", "type": "string"},
    {"name": "platform", "type": "string", "default": "unknown"}
  ]
}

The ["null", "int"] union type is how Avro handles nullable fields - the type is a union of null and int. The "default": null is required for nullable fields (the default must match the first type in the union).

In Avro, the wire format encodes field values in schema order. There are no field numbers or field names in the binary payload - just values, back to back, in the order defined by the schema. This means:

Both producer and consumer must have the schema to encode/decode
Avro files typically embed the schema in the file header
For streaming (Kafka), schemas are distributed via a Schema Registry

Avro schema evolution rules:

Adding fields is safe only if the new field has a default value. Removing fields is safe only if the removed field had a default value (so old consumers can still use it). Changing field types is generally not safe. Renaming fields can be done safely using aliases.

{
  "name": "user_identifier",
  "type": "long",
  "aliases": ["user_id"]
}

The alias tells Avro: when reading data written with schema version N (which used user_id), map it to the new field name user_identifier.

The Confluent Schema Registry

The Schema Registry is a service that stores and manages Avro (and protobuf and JSON Schema) schemas. Every schema has a subject (typically {topic-name}-value or {topic-name}-key) and a version. Producers register schemas before publishing; consumers fetch schemas by ID to decode messages.

The Kafka message format with Schema Registry uses a 5-byte prefix on every message:

1 magic byte (0x00)
4 bytes: schema ID (integer)

The consumer reads the schema ID, fetches the schema from the registry (once, then caches it), and uses that schema to decode the payload.

Compatibility modes control what schema changes are allowed:

Mode	Rule	Use Case
`BACKWARD`	New schema must be readable by old consumer	Rolling consumer deployments
`FORWARD`	Old schema must be readable by new consumer	Rolling producer deployments
`FULL`	Both BACKWARD and FORWARD	Strict schema governance
`NONE`	No compatibility checking	Development/testing only

BACKWARD compatibility (most common): If you deploy a new producer before updating consumers, existing consumers must still be able to read new messages. This means:

New fields must have defaults (so old consumers can skip them)
Required fields cannot be removed (old consumers expect them)

FORWARD compatibility: If you deploy new consumers before updating producers, new consumers must still be able to read old messages. This means:

New consumers can handle missing new fields
Old producers' fields cannot be removed from the schema

FULL compatibility is the gold standard: you can deploy producers and consumers independently in either order.

Code: Protocol Buffers in Python

Defining and Using a Protobuf Schema

First, install the protobuf compiler and Python library:

pip install grpcio-tools protobuf

Define the schema in listen_event.proto:

syntax = "proto3";

package spotify.events;

message ListenEvent {
  int64 user_id = 1;
  string track_id = 2;
  int32 duration_ms = 3;
  int32 skip_at_ms = 4;
  int64 timestamp_ms = 5;
  string country_code = 6;
  string platform = 7;
}

message ListenEventBatch {
  repeated ListenEvent events = 1;
}

Generate Python code:

python -m grpc_tools.protoc -I. --python_out=. listen_event.proto

This generates listen_event_pb2.py. Now serialize and deserialize:

import time
from listen_event_pb2 import ListenEvent, ListenEventBatch

# Serialize a single event
event = ListenEvent(
    user_id=19284710,
    track_id="spotify:track:4u7EnebtmKWzUH433cf5Qv",
    duration_ms=187000,
    skip_at_ms=0,  # 0 means no skip in proto3 (no null)
    timestamp_ms=int(time.time() * 1000),
    country_code="US",
    platform="mobile"
)

# Serialize to bytes
payload = event.SerializeToString()
print(f"Protobuf size: {len(payload)} bytes")

# Deserialize from bytes
decoded = ListenEvent()
decoded.ParseFromString(payload)
print(f"user_id: {decoded.user_id}")
print(f"track_id: {decoded.track_id}")
print(f"platform: {decoded.platform}")

# Batch serialization - much more efficient than one-by-one
batch = ListenEventBatch()
for i in range(1000):
    e = batch.events.add()
    e.user_id = 19284710 + i
    e.track_id = f"spotify:track:{i:040d}"
    e.duration_ms = 180000
    e.timestamp_ms = int(time.time() * 1000)
    e.country_code = "US"
    e.platform = "mobile"

batch_payload = batch.SerializeToString()
print(f"Batch of 1000 events: {len(batch_payload):,} bytes")
print(f"Per-event average: {len(batch_payload) / 1000:.1f} bytes")

Handling Schema Evolution Safely

# Simulating backward compatibility: deserialize v1 data with v2 schema
# v1 schema had no 'platform' field
# v2 schema adds 'platform' with default ""

# This is the v1 payload (serialized without platform field)
v1_event = ListenEvent(
    user_id=19284710,
    track_id="spotify:track:abc",
    duration_ms=120000,
    timestamp_ms=1705276620000,
    country_code="DE"
    # No platform field
)
v1_bytes = v1_event.SerializeToString()

# Deserialize v1 bytes with v2 schema - platform will be "" (default)
decoded_with_v2 = ListenEvent()
decoded_with_v2.ParseFromString(v1_bytes)
print(f"platform from v1 data: '{decoded_with_v2.platform}'")  # ""
# Proto3 default for string is ""

# Safe field removal: always reserve the field number
# In the .proto file, add: reserved 4;
# This prevents future fields from reusing field number 4
# and causing silent data corruption when old data is read

Code: Avro with Confluent Schema Registry

from confluent_kafka import Producer, Consumer
from confluent_kafka.schema_registry import SchemaRegistryClient
from confluent_kafka.schema_registry.avro import AvroSerializer, AvroDeserializer
from confluent_kafka.serialization import StringSerializer, SerializationContext, MessageField
import json

# Schema Registry configuration
schema_registry_conf = {'url': 'http://localhost:8081'}
schema_registry_client = SchemaRegistryClient(schema_registry_conf)

# Avro schema definition
listen_event_schema_str = json.dumps({
    "type": "record",
    "name": "ListenEvent",
    "namespace": "com.spotify.events",
    "fields": [
        {"name": "user_id", "type": "long"},
        {"name": "track_id", "type": "string"},
        {"name": "duration_ms", "type": "int"},
        {
            "name": "skip_at_ms",
            "type": ["null", "int"],
            "default": None  # Python None = Avro null
        },
        {"name": "timestamp_ms", "type": "long"},
        {"name": "country_code", "type": "string"},
        {
            "name": "platform",
            "type": "string",
            "default": "unknown"  # Safe addition: has a default
        }
    ]
})

# Create serializer - this registers the schema if not already registered
# and checks compatibility before registering
avro_serializer = AvroSerializer(
    schema_registry_client,
    listen_event_schema_str,
    lambda event, ctx: event  # dict passthrough
)

avro_deserializer = AvroDeserializer(
    schema_registry_client,
    listen_event_schema_str,
    lambda event, ctx: event
)

# --- PRODUCER ---
producer_conf = {
    'bootstrap.servers': 'localhost:9092',
    'key.serializer': StringSerializer('utf_8'),
    'value.serializer': avro_serializer
}

from confluent_kafka.serialization import SerializingProducer
producer = SerializingProducer(producer_conf)

event = {
    "user_id": 19284710,
    "track_id": "spotify:track:4u7EnebtmKWzUH433cf5Qv",
    "duration_ms": 187000,
    "skip_at_ms": None,
    "timestamp_ms": 1705276620000,
    "country_code": "US",
    "platform": "mobile"
}

producer.produce(
    topic='listen-events',
    key=str(event['user_id']),
    value=event,
    on_delivery=lambda err, msg: print(f"Delivered: {msg.partition()}:{msg.offset()}")
)
producer.flush()

# --- CONSUMER ---
consumer_conf = {
    'bootstrap.servers': 'localhost:9092',
    'key.deserializer': StringDeserializer('utf_8'),
    'value.deserializer': avro_deserializer,
    'group.id': 'recommendation-processor',
    'auto.offset.reset': 'earliest'
}

from confluent_kafka.serialization import DeserializingConsumer
consumer = DeserializingConsumer(consumer_conf)
consumer.subscribe(['listen-events'])

while True:
    msg = consumer.poll(1.0)
    if msg is None:
        break
    if msg.error():
        print(f"Consumer error: {msg.error()}")
        continue

    decoded_event = msg.value()
    print(f"user_id={decoded_event['user_id']}, "
          f"track={decoded_event['track_id']}, "
          f"platform={decoded_event['platform']}")

consumer.close()

Checking Schema Compatibility Programmatically

from confluent_kafka.schema_registry import SchemaRegistryClient, Schema

client = SchemaRegistryClient({'url': 'http://localhost:8081'})

# Check if a new schema is compatible before registering it
new_schema_str = json.dumps({
    "type": "record",
    "name": "ListenEvent",
    "namespace": "com.spotify.events",
    "fields": [
        {"name": "user_id", "type": "long"},
        {"name": "track_id", "type": "string"},
        {"name": "duration_ms", "type": "int"},
        {"name": "skip_at_ms", "type": ["null", "int"], "default": None},
        {"name": "timestamp_ms", "type": "long"},
        {"name": "country_code", "type": "string"},
        {"name": "platform", "type": "string", "default": "unknown"},
        # Adding a new field - has a default, should be BACKWARD compatible
        {"name": "audio_quality", "type": "string", "default": "normal"}
    ]
})

subject = "listen-events-value"
new_schema = Schema(new_schema_str, schema_type="AVRO")

is_compatible = client.test_compatibility(subject, new_schema)
print(f"Schema is compatible: {is_compatible}")

if is_compatible:
    schema_id = client.register_schema(subject, new_schema)
    print(f"Registered schema ID: {schema_id}")
else:
    print("Schema registration blocked - would break existing consumers")
    # This is what should have caught the Spotify incident!

Code: Serialization Throughput Benchmark

import json
import time
import struct
from typing import Dict, Any, List

# We'll use fastavro for the Avro benchmark (no Schema Registry needed)
import fastavro
import io

# For protobuf, we generate a simple schema
# (In practice, generate from .proto file)

# ---- Data Setup ----
def generate_events(n: int) -> List[Dict[str, Any]]:
    """Generate n sample listen events as Python dicts."""
    return [
        {
            "user_id": 10000000 + i,
            "track_id": f"spotify:track:{i:040d}",
            "duration_ms": 180000 + (i % 60000),
            "timestamp_ms": 1705276620000 + i * 1000,
            "country_code": ["US", "GB", "DE", "JP", "BR"][i % 5],
            "platform": ["mobile", "desktop", "web"][i % 3]
        }
        for i in range(n)
    ]

# ---- JSON Benchmark ----
def benchmark_json(events: List[Dict], n_runs: int = 3) -> Dict:
    times = []
    sizes = []

    for _ in range(n_runs):
        start = time.perf_counter()
        serialized = [json.dumps(e).encode() for e in events]
        elapsed = time.perf_counter() - start
        times.append(elapsed)
        sizes.append(sum(len(s) for s in serialized))

    return {
        "format": "JSON",
        "avg_time_ms": sum(times) / n_runs * 1000,
        "total_bytes": sum(sizes) // n_runs,
        "bytes_per_record": (sum(sizes) // n_runs) / len(events)
    }

# ---- Avro Benchmark (fastavro) ----
def benchmark_avro(events: List[Dict], n_runs: int = 3) -> Dict:
    schema = fastavro.parse_schema({
        "type": "record",
        "name": "ListenEvent",
        "fields": [
            {"name": "user_id", "type": "long"},
            {"name": "track_id", "type": "string"},
            {"name": "duration_ms", "type": "int"},
            {"name": "timestamp_ms", "type": "long"},
            {"name": "country_code", "type": "string"},
            {"name": "platform", "type": "string"}
        ]
    })

    times = []
    sizes = []

    for _ in range(n_runs):
        start = time.perf_counter()
        buf = io.BytesIO()
        fastavro.writer(buf, schema, events)
        elapsed = time.perf_counter() - start
        times.append(elapsed)
        sizes.append(buf.tell())

    return {
        "format": "Avro",
        "avg_time_ms": sum(times) / n_runs * 1000,
        "total_bytes": sum(sizes) // n_runs,
        "bytes_per_record": (sum(sizes) // n_runs) / len(events)
    }

# ---- Run Benchmarks ----
N = 100_000
print(f"Benchmarking serialization of {N:,} events...\n")
events = generate_events(N)

results = [
    benchmark_json(events),
    benchmark_avro(events),
]

print(f"{'Format':<12} {'Avg Time (ms)':<16} {'Total Size (MB)':<18} {'Bytes/Record':<14}")
print("-" * 62)
for r in results:
    print(
        f"{r['format']:<12} "
        f"{r['avg_time_ms']:>12.1f}   "
        f"{r['total_bytes']/1024/1024:>14.2f}   "
        f"{r['bytes_per_record']:>12.1f}"
    )

# Typical results for 100K events:
# Format       Avg Time (ms)    Total Size (MB)    Bytes/Record
# JSON              890.2             22.40             224.0
# Avro              142.6              8.71              87.1
# Protobuf*          95.3              6.42              64.2
# (* protobuf requires compiled .proto - use google.protobuf if you have it)

Format Comparison Table

Feature	JSON	Protobuf	Avro	Parquet
Type system	Weak (no int32/64 distinction)	Strong	Strong	Strong
Wire size	Large (~200B/record)	Small (~60B/record)	Small (~85B/record)	Very small (columnar)
Serialization speed	Slow	Fast	Medium	Batch only
Schema required	No	Yes (.proto file)	Yes (JSON schema)	Yes (embedded)
Schema evolution	None enforced	Good (field numbers)	Good (with registry)	Good
Human readable	Yes	No	No	No
Streaming	Yes	Yes	Yes (with registry)	No
Columnar reads	No	No	No	Yes
Primary use	REST APIs, config	gRPC, microservices	Kafka topics	Data lake files
Language support	Universal	Very broad	Broad	Broad

Architecture Diagram: Kafka + Schema Registry

YouTube Resources

Title	Channel	Why Watch
Protobuf or JSON? Choose Wisely	ByteByteGo	Crisp comparison of wire formats with real performance numbers
Apache Kafka Schema Registry Deep Dive	Confluent	Official deep dive from the team that built Schema Registry
Avro Schema Evolution in Production	DataCouncil	Engineering war stories from teams managing evolving schemas
Protocol Buffers Tutorial	TechWorld with Nana	Hands-on protobuf tutorial with Docker and gRPC
Data Serialization Formats Explained	Arjan Codes	Python-focused walkthrough of JSON, protobuf, and Avro

Production Engineering Notes

The Schema Registry Is Not Optional

Teams routinely try to use Avro without a Schema Registry by embedding schemas in messages or sharing schema files via Git. This works until it does not. When a schema changes, the race condition begins: some producers are on the new schema, some consumers are on the old schema. Without a registry enforcing compatibility, you have no single source of truth and no enforcement layer. Use the Schema Registry. Run it in HA mode (at least 3 nodes backed by Kafka internal topics) in production.

Field Number Discipline in Protobuf

In protobuf, field numbers 1–15 are encoded in 1 byte; field numbers 16–2047 take 2 bytes. Put your most frequently populated fields in the 1–15 range. This is especially important for message types that are sent billions of times per day. A 1-byte vs 2-byte field tag difference sounds trivial until it is multiplied by 5 billion messages.

The Null Handling Trap

Avro unions for nullable fields have a subtle ordering rule: the default value must match the first type in the union. ["null", "int"] with "default": null is correct. ["int", "null"] with "default": null will fail - the default must match int. This is a common source of schema registration errors.

// CORRECT: null first, default is null
{"name": "skip_at_ms", "type": ["null", "int"], "default": null}

// WRONG: default null but int is first type
{"name": "skip_at_ms", "type": ["int", "null"], "default": null}
// Error: default value null is not an "int"

Protobuf in Proto3: No Required Fields

Proto3 eliminated the concept of required fields. Every field is optional and has a default value (0 for numbers, "" for strings, false for bools). This makes evolution easier but means you cannot enforce required fields at the serialization layer. Validate required fields in application code - do not assume the protobuf layer handles it.

Confluent Schema Registry Capacity Planning

The Schema Registry stores schemas in an internal Kafka topic. Schemas are small (typically 1–10 KB each), so storage is not a concern. The concern is read latency: every message decode needs the schema. The registry client caches schemas in memory, so after warmup, schema lookups are local. The warm-up period matters: a new consumer instance with an empty cache will hit the registry for every distinct schema ID it sees. With high schema ID turnover, this can cause a thundering herd. Set a generous cache size (max.schemas.per.subject or schema.registry.cache.capacity).

Common Mistakes

:::danger Adding a field without a default in Avro If you add a field without a default value to an Avro schema and your compatibility mode is BACKWARD, the Schema Registry will reject the registration. If compatibility checking is disabled (or set to NONE), the registration will succeed - and existing consumers will crash when they try to deserialize messages that include the new required field without having an updated schema.

This is exactly what caused the Spotify incident in the opening scenario. Always add default values for new fields. Always enable compatibility checking in production. :::

:::danger Reusing protobuf field numbers If you delete field 4 (skip_at_ms) and later add a new field but accidentally give it field number 4, any old data containing the original field 4 will be silently decoded into the wrong field. This can corrupt your data pipeline without any errors. Always reserved 4; in your proto file after removing a field. :::

:::warning Using NONE compatibility mode in Kafka Schema Registry supports NONE compatibility mode, which turns off all compatibility checking. Teams often set this during development and forget to change it before going to production. In production, BACKWARD or FULL should be the default. Use NONE only in development/test environments. :::

:::warning Evolving schemas under an existing Avro topic without Schema Registry If you are writing Avro files directly (not through Kafka/Schema Registry) and you update the schema without updating all readers, you will get read failures. The Avro file header contains the writer's schema. If you change the schema but the reader has an old compiled schema, deserialization will fail. Use fastavro.reader() which reads the writer's schema from the file header and performs schema resolution against the reader's schema automatically. :::

:::warning Large Avro schemas in high-throughput topics The 5-byte Schema Registry prefix (magic byte + schema ID) is negligible overhead. But if you have deep nested Avro schemas with many optional fields, the binary encoding can actually be larger than expected. Avro union encoding adds overhead for nullable types (a union discriminator byte). For very high-throughput simple messages, protobuf with varint encoding may be more efficient. :::

Interview Q&A

Q1: Your team is running a Kafka pipeline with Avro and Schema Registry. A developer wants to rename the field user_id to userId (camelCase) to match the new API convention. How do you do this safely, and what are the risks?

In Avro, field names are not in the binary payload - field ordering is. So renaming is not automatically safe. The safe approach is to use Avro aliases: add the new field name as the primary name and the old name as an alias.

{
  "name": "userId",
  "type": "long",
  "aliases": ["user_id"]
}

With aliases, Avro schema resolution will map data written with user_id to the reader field userId. You register this new schema (it is BACKWARD compatible with aliases), deploy updated consumers, then later update producers. The risk is that any consumers using reflection or code generation may have hardcoded event.user_id and need to be updated to event.userId. You must coordinate this carefully - test with both the old and new schema before deploying widely.

Q2: Explain backward vs forward compatibility in Schema Registry. Which should you default to and why?

Backward compatibility means the new schema can read data written with the old schema. This is the safest for rolling deployments: you deploy new consumers first (which can read both old and new messages), then deploy new producers. New fields must have defaults. Old fields that are removed must have had defaults.

Forward compatibility means the old schema can read data written with the new schema. You deploy new producers first. New fields must be ignorable by old consumers.

Full compatibility is both simultaneously - the safest, most restrictive mode.

For most production Kafka systems, BACKWARD is the right default because it allows you to update consumers before producers. In practice, consumers need to handle new message fields gracefully. If you set FULL, you get the strongest guarantees but it is harder to evolve schemas (you cannot remove fields that did not have defaults). Start with BACKWARD, upgrade to FULL when your team has the discipline to always add defaults.

Q3: Why does protobuf use field numbers instead of field names? What is the practical implication?

Protobuf uses field numbers because it makes the wire format independent of names. This has three practical implications:

You can rename fields without breaking wire compatibility - old serialized data is still readable with the new schema because field numbers are unchanged.
The wire format is much more compact - a 1-byte field tag (for fields 1–15) vs a full field name string.
Field numbers must be treated as permanent identifiers. You cannot reuse a field number, even after removing the field. If you remove field 4 and add a new field 4 later, old data containing the original field 4 will be decoded as the new field's type, silently corrupting data.

The practical implication is that protobuf schema evolution requires discipline around field number management. Use reserved to mark removed field numbers.

Q4: Your Kafka consumer is failing because the schema changed. The producer deployed a new schema (v3) but your consumer was compiled against v2. Walk through exactly what happens and how you fix it without downtime.

What happens: the producer serializes with schema ID 42 (v3). The consumer reads the 5-byte prefix, gets schema ID 42, fetches v3 from the Schema Registry. The consumer was compiled against the v2 schema definition. If v3 added a field with a default value (BACKWARD compatible), the consumer can still deserialize - the new field is simply absent from the deserialized object. The consumer works fine.

If v3 was not BACKWARD compatible (no default on new field, or field removed without a previous default), the consumer throws a schema mismatch error. To fix without downtime:

Immediately pause the producer deployment (roll back producer to use v2 schema)
The consumer lag will drain as backlogged v3 messages are eventually re-attempted or sent to the DLQ
Update the consumer to handle v3
Deploy the updated consumer
Re-deploy the producer with v3

To prevent this in the future: enforce compatibility checking in Schema Registry, make schema registration a required step in the producer's CI pipeline, and test schema evolution in a staging environment.

Q5: When would you choose JSON over Avro or Protobuf for a Kafka topic?

JSON over Avro/Protobuf makes sense in three scenarios:

Debuggability during development: Being able to read Kafka messages with kafka-console-consumer without any schema tools is genuinely valuable during early development. Some teams use JSON in development and Avro in production.
External producer integration: When third-party systems or partners are sending data to your Kafka topic and they cannot or will not use your schema format. JSON is the lowest common denominator.
Low-volume, high-variety data: If you have hundreds of different event types with very low volume each (e.g., audit events, system alerts), the operational overhead of maintaining Avro schemas for each may outweigh the benefits. JSON with application-level validation is simpler.

In all other cases - high-volume topics, ML training pipelines, systems requiring schema governance - use Avro with Schema Registry or protobuf. The performance and reliability benefits are not theoretical.

Q6: A new ML team wants to add three new fields to your existing Kafka topic's Avro schema for model training features. The topic has 20 consumer groups. What process do you follow?

This is a real governance challenge at large organizations. The process:

Propose the schema change via a PR to the schema repository. The PR includes the new field definitions, all with default values (mandatory for BACKWARD compatibility), and justification.
Run compatibility check against all existing schema versions: client.test_compatibility("topic-value", new_schema). This must pass before merging.
Register the schema in staging Schema Registry, deploy one consumer group against it to verify end-to-end.
Communicate to all 20 consumer groups that a schema change is coming. Give a deprecation window for any consumers that may be sensitive to new fields appearing.
Deploy the producer with the new schema. Existing consumers using BACKWARD-compatible schemas will see the new fields with their defaults - no breakage.
Update consumers that want to use the new fields over the following sprint.

The key invariant: never register a schema without defaults for new fields. Never bypass compatibility checking. These two rules prevent 95% of schema-related incidents.

Summary

Serialization format is not a minor implementation detail - it is an architectural constraint that affects throughput, schema evolution, and the coupling between your producers and consumers. JSON is appropriate for external APIs and low-volume systems where human readability matters. Protobuf is the right choice for gRPC microservices and any system where compact binary encoding with strong types is needed. Avro with the Confluent Schema Registry is the gold standard for Kafka-based event streaming: centralized schema governance with enforced compatibility rules that prevent the 2 AM incidents.

The Schema Registry's compatibility modes are your enforcement layer. BACKWARD compatibility means new fields must have defaults. FULL compatibility is the highest safety standard. Enabling and enforcing compatibility checks in your CI pipeline is the single most impactful operational practice for preventing schema-related outages.

The Incident at 2 AM​

Why This Exists​

Historical Context​

Core Concepts​

What Serialization Actually Does​

JSON: The Ubiquitous Default​

Protocol Buffers: Binary with Strong Types​

Apache Avro: Schema-in-Band​

The Confluent Schema Registry​

Code: Protocol Buffers in Python​

Defining and Using a Protobuf Schema​

Handling Schema Evolution Safely​

Code: Avro with Confluent Schema Registry​

Checking Schema Compatibility Programmatically​

Code: Serialization Throughput Benchmark​

Format Comparison Table​

Architecture Diagram: Kafka + Schema Registry​

YouTube Resources​

Production Engineering Notes​

The Schema Registry Is Not Optional​

Field Number Discipline in Protobuf​

The Null Handling Trap​

Protobuf in Proto3: No Required Fields​

Confluent Schema Registry Capacity Planning​

Common Mistakes​

Interview Q&A​

Summary​