Skip to main content

Serialization and Data Formats

The Model That Could Not Cross the Wire

Your ML team has spent six weeks training a production recommendation model. Embeddings for 50 million items, 14 transformer layers, a custom attention mask. It achieves 7% lift over the existing model in A/B testing. The deploy date arrives. The serving team opens the checkpoint file and gets a Python error: AttributeError: Can't get attribute 'CustomAttentionLayer' on module '__main__'.

The model was saved with torch.save(model, path). That serializes the entire Python object graph using pickle, which encodes class references by name. The training environment had CustomAttentionLayer defined in __main__. The serving container has a different entrypoint. The class path does not match. The pickle deserializer cannot find the class and refuses to load. Six weeks of training. One pickle detail. Production blocked for three days while the teams figure out how to reconstruct the class in the right namespace.

This is not an edge case. It happens every time an organization's model training environment and serving environment have the slightest divergence. And it is completely avoidable with the right serialization approach - one that separates tensor data from code, stores structure explicitly, and does not embed runtime class references.

The problem goes deeper than loading errors. Consider a feature serving system that must serialize a feature vector for 200,000 users per second into a Kafka topic. JSON is the obvious choice - human-readable, universally parseable - but at that throughput, JSON encoding becomes the bottleneck. The same data in MessagePack is 40% smaller and 5x faster to encode. In Protocol Buffers with a defined schema, it is 60% smaller and 10x faster. The choice of serialization format is a performance decision that compounds at scale.

Or consider a 10 TB training dataset of image features stored in CSV files. Reading those files for a single training epoch takes 4 hours just for I/O - and the CSV parser is spending most of that time converting ASCII digits back to floats. The same data in Apache Parquet with Snappy compression fits in 1.8 TB and reads in 22 minutes. The ML training code did not change. Only the format did.

Serialization is invisible infrastructure until it becomes the bottleneck or the failure mode. Every ML system - training data storage, feature pipelines, model checkpoints, inter-service communication, experiment tracking - makes serialization choices, usually implicitly and often poorly. This lesson makes those choices explicit, gives you the tools to evaluate them quantitatively, and shows the production patterns that avoid the common failure modes.

Why This Exists

Before specialized serialization formats existed, engineers stored structured data in two ways: plain text (CSV, XML, JSON) or language-native binary formats (Python pickle, Java serialization, MATLAB .mat files). Both approaches have fundamental problems that become severe at ML scale.

Text formats are human-readable but computationally expensive. A 32-bit float stored as its decimal string representation 3.141592653589793 uses 17 bytes. The same value in binary uses 4 bytes. For a neural network with 1 billion parameters, that is the difference between 4 GB and 17 GB of storage. More importantly, parsing text back into floats is 10-100x slower than reading binary data - the parser must scan bytes, handle whitespace, convert character sequences to integers, and apply decimal-point scaling. At TB scale, this overhead dominates training pipeline latency.

Language-native formats solve the performance problem but create a different one: they are tightly coupled to a specific language runtime and version. Python pickle encodes not just data but object identity: class names, module paths, and Python-version-specific internal structures. A pickle file created with Python 3.8 and PyTorch 1.13 may fail to load in Python 3.11 and PyTorch 2.1. A pickle file that includes a user-defined class embeds the fully-qualified class name; if that class is renamed or moved, the file becomes unloadable. For long-term model storage, this is unacceptable.

The solution is a set of purpose-built serialization formats that separate concerns:

  • Schema-aware binary formats (Protocol Buffers, FlatBuffers) for structured records where backward compatibility matters
  • Columnar formats (Apache Parquet, Arrow IPC) for analytical data where you read specific columns but not all rows
  • Tensor-specific formats (safetensors, HDF5) for numerical arrays where memory mapping and partial loading matter
  • Generic binary formats (MessagePack, CBOR) for high-throughput streaming where JSON overhead is measurable

Each format exists because a specific class of problem - throughput, safety, compatibility, query efficiency, memory mapping - was not solved by the alternatives.

Historical Context

The history of serialization formats is a history of engineers repeatedly solving the same problem at a larger scale.

CSV (comma-separated values) dates to the 1970s. It is simple, universally supported, and completely ambiguous: no standard encoding for null values, no schema, no nested structures, no distinction between strings and numbers. It persists because it is the lowest common denominator.

Protocol Buffers were developed at Google around 2001 and open-sourced in 2008. The motivation was internal: Google's services were exchanging data in ad-hoc text formats, and as the number of services grew, schema mismatches became a constant source of bugs. Protocol Buffers introduced the key ideas that every subsequent format has borrowed: field numbers instead of field names (enabling backward-compatible schema evolution), explicit typing, and a schema-first approach where the .proto definition is the contract.

Apache Parquet was developed jointly by Twitter and Cloudera in 2013, based on ideas in the Dremel paper from Google (2010). The insight was that analytical queries typically access a few columns across many rows. Row-oriented formats (CSV, JSON, database rows) store all columns of a row together, forcing you to read irrelevant columns. Columnar formats store all values of a column together, so a query for one column only reads that column's data.

Apache Arrow was created in 2016 as a language-independent specification for columnar in-memory data. The key innovation was not the format but the ecosystem: if multiple languages (Python, R, Java, C++) all represent columnar data in the same in-memory layout, you can share data between them without serialization - just pass a pointer. Arrow IPC (inter-process communication) format is what you use when you need to send Arrow data across process boundaries.

HDF5 (Hierarchical Data Format) was developed at NCSA in the 1990s for scientific computing. It supports hierarchical groups of named datasets with metadata, partial I/O (reading a slice of a dataset without loading the whole file), and compression. h5py made it accessible from Python and it became the default format for storing large numerical arrays in scientific ML before safetensors existed.

Safetensors was created by Hugging Face in 2022, specifically to address the security and performance problems of pickle-based model checkpoints. It stores tensors in a simple binary format with a JSON header, supports memory mapping (loading only the tensors you need without reading the full file), and executes no code during deserialization.

Core Concepts

The Binary vs Text Tradeoff

The fundamental choice in any serialization format is binary vs text. Text formats (JSON, CSV, XML) store data as human-readable character strings. Binary formats store data in its native in-memory representation or a compact binary encoding.

For a float32 value like π3.14159\pi \approx 3.14159:

FormatBytesRelative size
JSON string "3.14159265358979"184.5x
JSON string "3.14" (precision loss)61.5x
float32 binary (IEEE 754)41.0x
float16 binary20.5x
bfloat16 binary20.5x

For a model with N=109N = 10^9 parameters stored as float32: JSON size18N bytes=18 GB\text{JSON size} \approx 18N \text{ bytes} = 18 \text{ GB} Binary size=4N bytes=4 GB\text{Binary size} = 4N \text{ bytes} = 4 \text{ GB}

The parse time difference is larger than the size difference. A modern CPU can memory-copy binary float32 data at 20-40 GB/s. It can parse the same data from ASCII at roughly 0.2-0.4 GB/s - a 100x difference.

The case for text formats is readability, debuggability, and universality. If you need to inspect a single record in a feature store or examine a checkpoint header, JSON is vastly easier than a binary blob. The production pattern is to use binary formats for data at rest and for transport, with text format support as a debugging escape hatch.

Pickle: How It Works and Why It Is Dangerous

Python's pickle module implements a serialization protocol that can serialize and deserialize arbitrary Python objects. The key mechanism is that pickle stores object construction as a sequence of opcodes that the pickle interpreter executes on load. The pickle opcodes include: GLOBAL (look up a class by module and name), REDUCE (call a callable with arguments), BUILD (update an object's state), and STACK_GLOBAL (push class reference for later REDUCE).

import pickle
import io

# What torch.save does internally (simplified)
def torch_save_simplified(obj, f):
# pickle protocol 2 by default in older PyTorch
# protocol 4 or 5 in recent versions
pickle.dump(obj, f, protocol=4)

# What pickle actually emits for a simple object
class MyConfig:
def __init__(self, lr):
self.lr = lr

config = MyConfig(0.001)
data = pickle.dumps(config)

# Inspect the opcodes
import pickletools
pickletools.dis(io.BytesIO(data))
# GLOBAL '__main__ MyConfig' <- embeds class reference
# EMPTY_DICT
# MARK
# SHORT_BINUNICODE 'lr'
# FLOAT 0.001
# SETITEMS
# BUILD
# STOP

The GLOBAL '__main__ MyConfig' opcode is the problem. When this pickle is loaded, Python executes: import __main__; getattr(__main__, 'MyConfig'). If MyConfig is not in __main__ in the loading environment, you get AttributeError. If an attacker replaces this with os system and rm -rf /, the pickle executes it.

:::danger Pickle Is Executable Code - Never Load Untrusted Pickle Files pickle.load() and torch.load() with default settings execute arbitrary Python code embedded in the data. Loading a .pt checkpoint file from an untrusted source (including some public Hugging Face repositories) can compromise your entire machine. Use torch.load(path, weights_only=True) in PyTorch 2.0+ or use safetensors for any checkpoint you did not create yourself. :::

The security vector is real and has been exploited. The ML community has demonstrated proof-of-concept malicious .pt files that establish reverse shells on load. Any ML serving system that loads user-provided model files without using weights_only=True or safetensors is vulnerable.

Safetensors: Safe, Fast, Memory-Mappable

Safetensors was designed with three requirements: no code execution during load, memory-mappable (load only what you need), and fast (competitive with raw binary copy). It achieves all three.

The format is simple:

  1. An 8-byte little-endian uint64 specifying the header length
  2. A JSON header containing tensor metadata (name, dtype, shape, data offsets)
  3. Zero-padding to align to a multiple of 8 bytes
  4. Raw tensor data, concatenated
import safetensors.torch as st
import torch

# Save a model's state dict
model = torch.nn.Linear(1024, 1024)
state_dict = model.state_dict()

st.save_file(state_dict, "model.safetensors")

# Load everything
loaded = st.load_file("model.safetensors")

# Load only specific tensors (memory-mapped - doesn't load the rest)
from safetensors import safe_open

with safe_open("model.safetensors", framework="pt", device="cpu") as f:
# Only the weight tensor is loaded into memory
weight = f.get_tensor("weight")
print(f.metadata()) # JSON header inspection without loading tensors

# Compare with pickle: torch.load loads the ENTIRE file
state_dict_pickle = torch.load("model.pt") # loads everything

# Memory mapping: safetensors maps file into virtual memory
# Only the pages you actually access are fetched from disk
# For a 7B model, you can inspect tensor shapes without loading 14GB
with safe_open("model.safetensors", framework="pt") as f:
for key in f.keys():
tensor_info = f.get_slice(key)
print(f"{key}: {tensor_info.get_shape()}")
# No GPU memory used yet

The memory-mapping property is critical for large models. A 70B parameter model in float16 is 140 GB. With safetensors and memory mapping, you can inspect the metadata, load only the layers you need for a specific inference task, or load a model onto multiple GPUs by mapping different tensor slices to different devices - without ever holding the entire 140 GB in RAM simultaneously.

Protocol Buffers: Schema-First Binary Serialization

Protocol Buffers (protobuf) encode structured data as a sequence of field-value pairs. Every field has a field number (not a name - names are only in the .proto schema file), a wire type (varint, 64-bit, length-delimited, 32-bit), and a value. The separation of field number from field name is what enables backward-compatible schema evolution.

// ml_features.proto
syntax = "proto3";

package ml.features;

message UserFeatures {
uint64 user_id = 1;
float session_duration_sec = 2;
repeated float embedding = 3 [packed = true];
int32 num_clicks = 4;
repeated string recent_categories = 5;
// Adding new field in v2 - old readers ignore it, new readers read it
float purchase_probability = 6;
}

message FeatureBatch {
repeated UserFeatures features = 1;
int64 timestamp_ms = 2;
string model_version = 3;
}
# Generated from: protoc --python_out=. ml_features.proto
from ml_features_pb2 import UserFeatures, FeatureBatch
import time

# Serialize
user = UserFeatures(
user_id=12345678,
session_duration_sec=142.5,
embedding=[0.1, 0.2, -0.3, 0.8] * 64, # 256-dim embedding
num_clicks=7,
recent_categories=["electronics", "books"]
)

batch = FeatureBatch(
features=[user],
timestamp_ms=int(time.time() * 1000),
model_version="rec-v3.2.1"
)

serialized = batch.SerializeToString()
print(f"Protobuf size: {len(serialized)} bytes")

# Deserialize
loaded = FeatureBatch()
loaded.ParseFromString(serialized)
print(loaded.features[0].user_id) # 12345678

# Compare with JSON
import json
json_repr = {
"user_id": 12345678,
"embedding": [0.1, 0.2, -0.3, 0.8] * 64,
# ...
}
json_bytes = json.dumps(json_repr).encode()
print(f"JSON size: {len(json_bytes)} bytes")
# Typical ratio: protobuf ~40% of JSON size for numerical data

The wire format is length-delimited with field numbers as keys. Field 3 (embedding) with packed = true stores all the floats as a contiguous binary blob with a single length prefix, avoiding per-element overhead. This is the key optimization for repeated numerical fields.

Wire encoding of UserFeatures:
Field 1 (user_id), wire type 0 (varint): 0x08 0x8E 0xC4 0xEF 0x05
Field 2 (session_duration_sec), wire type 5 (32-bit): 0x15 <4 bytes IEEE 754>
Field 3 (embedding), wire type 2 (length-delimited, packed): 0x1A <len> <256 floats>
Field 4 (num_clicks), wire type 0 (varint): 0x20 0x07

Schema evolution rules: you can add new fields (old readers ignore unknown fields, new readers use default values for missing fields), you can remove fields (but never reuse field numbers - mark as reserved), and you cannot change the type of an existing field. This makes protobuf suitable for long-lived data contracts where producers and consumers evolve independently.

Apache Arrow: Zero-Copy Columnar In-Memory

Apache Arrow's core insight is that the bottleneck in data pipelines is often not computation but data copying. When Python calls a Pandas function, then passes the result to a C++ library, then sends it to a Rust service, the default pattern involves multiple copies of the data: Python list - NumPy array - Pandas DataFrame - C++ std::vector - Rust Vec. Each conversion allocates new memory and copies bytes.

Arrow defines a language-independent in-memory format for columnar data. If Python, C++, Java, and Rust all represent a column of float32s as: a contiguous buffer of IEEE 754 floats, a validity bitmap (null mask), with metadata (type, length, null count) in a fixed layout - then passing data between them is just passing a pointer. Zero copies.

import pyarrow as pa
import pyarrow.ipc as ipc
import numpy as np

# Create an Arrow table from NumPy arrays
n_rows = 1_000_000
table = pa.table({
"user_id": pa.array(np.random.randint(0, 10**8, n_rows), type=pa.int64()),
"feature_1": pa.array(np.random.randn(n_rows), type=pa.float32()),
"feature_2": pa.array(np.random.randn(n_rows), type=pa.float32()),
"label": pa.array(np.random.randint(0, 2, n_rows), type=pa.int8()),
})

print(f"Table size: {table.nbytes / 1e6:.1f} MB")

# Write Arrow IPC file format (random access, footer index)
with pa.OSFile("features.arrow", "wb") as sink:
with ipc.new_file(sink, table.schema) as writer:
writer.write_table(table, max_chunksize=65536)

# Write Arrow IPC stream format (sequential, for network streaming)
buf = pa.BufferOutputStream()
with ipc.new_stream(buf, table.schema) as writer:
writer.write_table(table)
stream_bytes = buf.getvalue()
print(f"IPC stream size: {len(stream_bytes) / 1e6:.1f} MB")

# Memory-mapped read - zero-copy if the data is already in the right format
with pa.memory_map("features.arrow", "r") as source:
loaded = ipc.open_file(source).read_all()
# loaded.column("feature_1") is a view into the mmap'd pages
# No copy. No allocation. Just a pointer.

# Convert to NumPy - also zero-copy for numeric types
feat1_np = loaded.column("feature_1").to_pylist() # slow: Python list
feat1_np = loaded.column("feature_1").to_numpy() # fast: zero-copy buffer view

# Zero-copy interop: Arrow - NumPy - PyTorch
import torch
feat_tensor = torch.from_numpy(loaded.column("feature_1").to_numpy())
# No allocation. feat_tensor and the Arrow column share the same memory pages.

The Arrow IPC format has two variants: the File format (random access, stores a schema at the start and an index at the end) and the Stream format (sequential, no seeking required, suitable for network streaming). For a Kafka-based feature pipeline, the Stream format is appropriate. For a local feature store with random batch access, the File format with memory mapping is faster than any other option.

# Arrow's filter pushdown: read only matching rows
import pyarrow.compute as pc

# Predicate: only active users
mask = pc.equal(loaded.column("label"), 1)
active_users = loaded.filter(mask)

# Column selection: only the features, not the label
features_only = loaded.select(["user_id", "feature_1", "feature_2"])

# Both operations are lazy where possible - minimal data movement

Apache Parquet: Columnar Storage for ML Datasets

Parquet is Arrow's on-disk sibling. Where Arrow defines an in-memory format, Parquet defines a compressed columnar file format optimized for storage and analytical queries. The two formats interoperate: Parquet files can be read directly into Arrow tables.

The Parquet file structure:

  1. File magic bytes: PAR1
  2. Row groups: the file is divided into row groups (default 128 MB each)
  3. Within each row group, each column is stored as a column chunk
  4. Column chunks contain pages of encoded, compressed data
  5. Footer: statistics (min/max/null count) per column chunk, schema, row group offsets
  6. File magic bytes: PAR1
import pyarrow.parquet as pq
import pyarrow as pa
import numpy as np

# Write Parquet with compression and column-level statistics
table = pa.table({
"user_id": pa.array(np.random.randint(0, 10**8, 5_000_000)),
"age": pa.array(np.random.randint(18, 80, 5_000_000).astype(np.int8)),
"embedding": pa.array(list(np.random.randn(5_000_000, 128).astype(np.float32))),
"label": pa.array(np.random.randint(0, 2, 5_000_000).astype(np.int8)),
})

pq.write_table(
table,
"dataset.parquet",
compression="snappy", # fast, decent ratio
# compression="zstd", # better ratio, slightly slower
row_group_size=128 * 1024, # rows per row group (not bytes)
write_statistics=True, # enables predicate pushdown
use_dictionary=True, # dictionary encoding for low-cardinality columns
)

# Read only the columns you need (predicates push down to C++ layer)
dataset = pq.read_table(
"dataset.parquet",
columns=["user_id", "embedding"], # only read these two columns
filters=[("label", "==", 1)], # row group filtering using statistics
)
print(f"Rows after filter: {len(dataset)}")

# Partition by date for large datasets
# Files written to: dataset/date=2024-01-15/part-0.parquet
pq.write_to_dataset(
table,
root_path="partitioned_dataset",
partition_cols=["label"], # creates label=0/ and label=1/ subdirs
)

# Read a partition directly - only reads relevant files
pos_data = pq.read_table("partitioned_dataset/label=1")

The statistics stored in the Parquet footer enable predicate pushdown: if you filter on label == 1 and a row group has min=0, max=0, the reader skips the entire row group without reading its data. For ML training datasets where you want positive examples, this can reduce I/O by 50% or more depending on class balance.

MessagePack: Binary JSON for High-Throughput Streaming

MessagePack is a binary serialization format that supports the same data model as JSON (maps, arrays, strings, integers, floats, booleans, null) but encodes values in a compact binary form. The encoding is self-describing (no schema required) and significantly faster to encode/decode than JSON.

import msgpack
import json
import time

# Sample feature record
record = {
"user_id": 12345678,
"features": [0.1, -0.3, 0.8, 0.2] * 64, # 256 floats
"categories": ["electronics", "books"],
"timestamp": 1706025600,
}

# Benchmark
N = 100_000

start = time.perf_counter()
for _ in range(N):
json_bytes = json.dumps(record).encode()
json_time = time.perf_counter() - start

start = time.perf_counter()
for _ in range(N):
msgpack_bytes = msgpack.packb(record, use_bin_type=True)
msgpack_time = time.perf_counter() - start

print(f"JSON: {len(json_bytes):5d} bytes, {json_time:.3f}s")
print(f"MsgPack: {len(msgpack_bytes):5d} bytes, {msgpack_time:.3f}s")
print(f"Size ratio: {len(json_bytes)/len(msgpack_bytes):.2f}x")
print(f"Speed ratio: {json_time/msgpack_time:.2f}x")
# Typical: JSON ~1200 bytes, MsgPack ~700 bytes, 1.7x size, 3-5x speed

# Round-trip
decoded = msgpack.unpackb(msgpack_bytes, raw=False)
assert decoded["user_id"] == record["user_id"]

# NumPy arrays: use msgpack-numpy for efficient array encoding
import msgpack_numpy as m
m.patch() # patches msgpack to handle numpy arrays

import numpy as np
arr_record = {"embedding": np.random.randn(256).astype(np.float32)}
packed = msgpack.packb(arr_record, default=m.encode)
unpacked = msgpack.unpackb(packed, object_hook=m.decode)
print(f"NumPy array round-trip: {unpacked['embedding'].dtype}")

MessagePack is the right choice when: you need a schema-less format (features can have variable structure), you are streaming data through Kafka or Redis and JSON throughput is measurable as a bottleneck, and you need language independence (MessagePack libraries exist for 50+ languages).

HDF5: Hierarchical Scientific Data

HDF5 organizes data as a filesystem-like hierarchy of groups (directories) and datasets (files). Each dataset is a named multi-dimensional array with arbitrary metadata attributes. This structure naturally fits the way ML experiments generate data: multiple runs, each with training curves, model weights, and hyperparameters.

import h5py
import numpy as np

# Writing an experiment archive
with h5py.File("experiment.h5", "w") as f:
# Create groups like directories
run1 = f.create_group("run_001")
run1.attrs["learning_rate"] = 0.001
run1.attrs["model_type"] = "transformer"
run1.attrs["timestamp"] = "2024-01-15T10:30:00Z"

# Store training metrics as datasets
run1.create_dataset("train_loss", data=np.array([2.4, 2.1, 1.8, 1.5]))
run1.create_dataset("val_loss", data=np.array([2.5, 2.2, 1.9, 1.7]))

# Store model weights with compression
weights = np.random.randn(4096, 4096).astype(np.float32)
run1.create_dataset(
"weights/layer_0",
data=weights,
compression="gzip",
compression_opts=4, # compression level 0-9
chunks=(256, 256), # chunk shape for partial I/O
)

# Chunked datasets enable partial loading
# Reading run1["weights/layer_0"][0:256, 0:256] only reads one chunk

# Reading: partial I/O without loading the whole file
with h5py.File("experiment.h5", "r") as f:
# List all runs
print(list(f.keys())) # ['run_001']

run = f["run_001"]
print(dict(run.attrs)) # hyperparameters

# Load only a slice of a large weight matrix
first_block = run["weights/layer_0"][0:256, 0:256]
print(f"Block shape: {first_block.shape}")

# Memory-mapped dataset: deferred loading
dset = f["run_001/weights/layer_0"]
print(f"Full shape: {dset.shape}, dtype: {dset.dtype}")
# The 64MB of floats are not in RAM yet

# HDF5 is well-suited for:
# - Experiment archives with heterogeneous data
# - Large numerical arrays with partial-read patterns
# - Self-describing data (metadata as attributes)
# - Scientific data interchange (MATLAB .mat v7.3 is HDF5)

The chunking system is HDF5's key performance mechanism. Without chunking, any read or write to a dataset requires accessing the entire dataset contiguously. With chunking, the dataset is divided into fixed-size chunks that are stored independently. Reading a 64x64 block from a 4096x4096 array with 256x256 chunks only reads the overlapping chunks, not the full 64 MB array.

FlatBuffers: Zero-Copy Deserialization

FlatBuffers, developed by Google, takes a different approach than Protocol Buffers. Where protobuf parses the wire format into an in-memory object tree (requiring allocation and copying), FlatBuffers stores data in a format that is directly usable in memory without parsing. A FlatBuffer can be memory-mapped and accessed field-by-field without any deserialization step.

# FlatBuffers Python example (using generated code from schema)
# Schema: flatc --python model_config.fbs

# The FlatBuffer wire format IS the in-memory format
# Reading field X does a single pointer dereference into the buffer
# No intermediate object allocation

import flatbuffers
from ModelConfig import ModelConfig # generated

# Build a FlatBuffer
builder = flatbuffers.Builder(1024)
model_name = builder.CreateString("transformer-xl")

ModelConfig.Start(builder)
ModelConfig.AddNumLayers(builder, 12)
ModelConfig.AddHiddenSize(builder, 768)
ModelConfig.AddModelName(builder, model_name)
config = ModelConfig.End(builder)
builder.Finish(config)

buf = bytes(builder.Output())
print(f"FlatBuffer size: {len(buf)} bytes")

# Zero-copy read: no parsing, no allocation
loaded_config = ModelConfig.ModelConfig.GetRootAs(buf, 0)
print(loaded_config.NumLayers()) # direct memory read
print(loaded_config.HiddenSize()) # direct memory read
print(loaded_config.ModelName()) # direct memory read

FlatBuffers trade schema complexity for maximum read performance. They are appropriate when: read throughput is the bottleneck, data is written rarely and read many times, and you can tolerate the schema-first workflow. TensorFlow Lite uses FlatBuffers for its .tflite model format, which is why TFLite inference can start without any model parsing step.

ONNX: Protobuf for Model Interchange

ONNX (Open Neural Network Exchange) is a protobuf-based format for representing computational graphs. A .onnx file contains the model's graph structure (nodes, edges, tensor shapes, operators) and its weights, all encoded as Protocol Buffers.

import onnx
import onnxruntime as ort
import numpy as np
import torch

# Export a PyTorch model to ONNX
model = torch.nn.Sequential(
torch.nn.Linear(128, 256),
torch.nn.ReLU(),
torch.nn.Linear(256, 10),
)
model.eval()

dummy_input = torch.randn(1, 128)
torch.onnx.export(
model,
dummy_input,
"model.onnx",
input_names=["input"],
output_names=["output"],
dynamic_axes={"input": {0: "batch_size"}, "output": {0: "batch_size"}},
opset_version=17,
)

# Inspect the ONNX protobuf
onnx_model = onnx.load("model.onnx")
print(f"ONNX IR version: {onnx_model.ir_version}")
print(f"Opset: {onnx_model.opset_import[0].version}")
for node in onnx_model.graph.node:
print(f" {node.op_type}: {list(node.input)} -> {list(node.output)}")

# Run with ONNX Runtime
session = ort.InferenceSession("model.onnx")
input_data = np.random.randn(32, 128).astype(np.float32)
outputs = session.run(None, {"input": input_data})
print(f"Output shape: {outputs[0].shape}")

The ONNX protobuf schema is publicly defined and versioned. An onnx.proto3 file specifies the exact wire format. This means any language with a protobuf library can read an ONNX model - you do not need the ONNX Python package to inspect a .onnx file. This is what enables ONNX Runtime to run on embedded devices and mobile platforms where a full Python stack is unavailable.

Serialization Format Comparison

import time
import numpy as np
import json
import pickle
import msgpack
import safetensors.numpy as sfn

# Benchmark: 10,000 float32 values (typical small feature vector batch)
data = {"features": np.random.randn(10000).astype(np.float32)}
N = 10_000

def benchmark(name, serialize_fn, deserialize_fn, data):
# Warmup
for _ in range(100):
buf = serialize_fn(data)
_ = deserialize_fn(buf)
# Measure
start = time.perf_counter()
for _ in range(N):
buf = serialize_fn(data)
write_time = (time.perf_counter() - start) / N * 1000 # ms

start = time.perf_counter()
for _ in range(N):
_ = deserialize_fn(buf)
read_time = (time.perf_counter() - start) / N * 1000 # ms

return len(buf), write_time, read_time

# JSON (baseline)
size_j, w_j, r_j = benchmark(
"JSON",
lambda d: json.dumps({"features": d["features"].tolist()}).encode(),
lambda b: json.loads(b),
data
)

# Pickle
size_p, w_p, r_p = benchmark(
"Pickle",
lambda d: pickle.dumps(d),
lambda b: pickle.loads(b),
data
)

# MessagePack
size_m, w_m, r_m = benchmark(
"MessagePack",
lambda d: msgpack.packb({"features": d["features"].tobytes()}),
lambda b: msgpack.unpackb(b),
data
)

# NumPy raw binary (theoretical best case)
size_n, w_n, r_n = benchmark(
"NumPy raw",
lambda d: d["features"].tobytes(),
lambda b: np.frombuffer(b, dtype=np.float32),
data
)

print(f"{'Format':<12} {'Size':>8} {'Write ms':>10} {'Read ms':>10}")
print("-" * 44)
for name, size, w, r in [
("JSON", size_j, w_j, r_j),
("Pickle", size_p, w_p, r_p),
("MsgPack", size_m, w_m, r_m),
("NumPy raw",size_n, w_n, r_n),
]:
print(f"{name:<12} {size:>8,} {w:>10.4f} {r:>10.4f}")

Typical results on modern hardware:

FormatSize (bytes)Write (ms)Read (ms)
JSON312,4501.82.1
Pickle40,1280.080.06
MessagePack40,0120.040.03
NumPy raw40,0000.0020.001
Safetensors40,2000.0050.003

JSON is 7.8x larger and 60-700x slower for pure numerical data. Pickle and MessagePack are comparable in size (both close to raw binary). Safetensors is near-raw-binary performance with the safety and memory-mapping benefits. The raw NumPy binary is the theoretical floor - every format adds some overhead above it.

Schema Evolution: Versioning Serialized Data

Schema evolution is the problem of changing the structure of serialized data while maintaining compatibility with older producers and consumers. This is one of the most practically important aspects of serialization for production ML systems.

# Problem: you trained with this feature schema (v1)
# user_id, age, num_clicks

# Six months later, you add (v2):
# user_id, age, num_clicks, purchase_history_len, device_type

# If you use JSON or pickle without explicit versioning,
# consumers built for v1 will crash on v2 records.

# Strategy 1: Always-optional fields with version tag
import json

def make_feature_record_v1(user_id, age, num_clicks):
return {
"schema_version": 1,
"user_id": user_id,
"age": age,
"num_clicks": num_clicks,
}

def make_feature_record_v2(user_id, age, num_clicks,
purchase_history_len, device_type):
return {
"schema_version": 2,
"user_id": user_id,
"age": age,
"num_clicks": num_clicks,
"purchase_history_len": purchase_history_len, # new in v2
"device_type": device_type, # new in v2
}

# Consumer that handles both versions
def parse_feature_record(record_bytes):
record = json.loads(record_bytes)
version = record.get("schema_version", 1) # default to v1 for legacy records
if version == 1:
return {
"user_id": record["user_id"],
"age": record["age"],
"num_clicks": record["num_clicks"],
"purchase_history_len": 0, # default for missing field
"device_type": "unknown", # default for missing field
}
elif version == 2:
return record
else:
raise ValueError(f"Unknown schema version: {version}")

Protocol Buffers handle this automatically via field numbers and default values:

// v1
message UserFeatures {
uint64 user_id = 1;
int32 age = 2;
int32 num_clicks = 3;
}

// v2: add fields 4 and 5 - backward compatible
// Old readers ignore fields 4,5. New readers get default values (0/"") for missing.
message UserFeatures {
uint64 user_id = 1;
int32 age = 2;
int32 num_clicks = 3;
int32 purchase_history_len = 4; // new - old readers ignore
string device_type = 5; // new - old readers ignore
// NEVER: reuse field number 1, 2, or 3 for a different field
// NEVER: change the type of field 1, 2, or 3
// To remove a field: mark as reserved, not just delete
reserved 6; // field 6 was removed in v2.1
reserved "old_field_name"; // prevent future reuse of the name
}
# Arrow/Parquet schema evolution
import pyarrow as pa

schema_v1 = pa.schema([
pa.field("user_id", pa.int64()),
pa.field("age", pa.int8()),
pa.field("num_clicks", pa.int32()),
])

schema_v2 = pa.schema([
pa.field("user_id", pa.int64()),
pa.field("age", pa.int8()),
pa.field("num_clicks", pa.int32()),
pa.field("purchase_history_len", pa.int32()), # new
pa.field("device_type", pa.string()), # new
])

# Reading a v1 Parquet file with a v2 schema: fill missing columns with null
import pyarrow.parquet as pq

table_v1 = pq.read_table(
"data_v1.parquet",
schema=schema_v2 # tells reader to fill missing cols with null
)
# New columns are all null in this table - handle appropriately in transforms

The general schema evolution rules that apply across formats:

  1. Adding optional fields is safe (old producers omit them, old consumers ignore them)
  2. Removing fields requires care - use tombstoning/reserved markers, never reuse IDs
  3. Renaming fields breaks compatibility in most formats (only protobuf field numbers are wire-stable)
  4. Changing types is almost always breaking - migrate data explicitly
  5. Always include a version tag even in "flexible" formats like JSON

Mermaid Diagram: Format Selection Decision Tree

Mermaid Diagram: Parquet File Structure

Mermaid Diagram: Arrow Zero-Copy Pipeline

Production Engineering Notes

Checkpoint Strategy for Large Models

import safetensors.torch as st
import torch
import os
import json
from pathlib import Path
import hashlib

class SafeCheckpointer:
"""
Production checkpoint strategy:
- Safetensors for weights (safe, fast, mmappable)
- JSON for optimizer state metadata
- Atomic write with rename (no partial reads)
- SHA256 manifest for integrity checking
"""

def __init__(self, checkpoint_dir: str):
self.checkpoint_dir = Path(checkpoint_dir)
self.checkpoint_dir.mkdir(parents=True, exist_ok=True)

def save(self, model: torch.nn.Module, optimizer: torch.optim.Optimizer,
step: int, metrics: dict):
# Write to a temp dir first, then atomic rename
tmp_dir = self.checkpoint_dir / f"step_{step}_tmp"
tmp_dir.mkdir(exist_ok=True)

# 1. Save model weights as safetensors
weights_path = tmp_dir / "model.safetensors"
st.save_file(model.state_dict(), str(weights_path))

# 2. Save optimizer state as pickle (it has no tensor security concern)
# Note: optimizer state contains Python objects, not just tensors
optimizer_state = {
k: v.cpu() if isinstance(v, torch.Tensor) else v
for k, v in optimizer.state_dict().items()
if k != "state" # skip per-parameter state for large models
}
import pickle
with open(tmp_dir / "optimizer.pkl", "wb") as f:
pickle.dump(optimizer.state_dict(), f)

# 3. Save metadata as JSON
meta = {
"step": step,
"metrics": metrics,
"model_class": type(model).__name__,
"optimizer_class": type(optimizer).__name__,
}
with open(tmp_dir / "meta.json", "w") as f:
json.dump(meta, f, indent=2)

# 4. Compute SHA256 for integrity
manifest = {}
for fpath in tmp_dir.iterdir():
sha = hashlib.sha256(fpath.read_bytes()).hexdigest()
manifest[fpath.name] = sha
with open(tmp_dir / "manifest.json", "w") as f:
json.dump(manifest, f)

# 5. Atomic rename (on same filesystem this is O(1), not a copy)
final_dir = self.checkpoint_dir / f"step_{step}"
tmp_dir.rename(final_dir)

def load(self, step: int, device: str = "cpu") -> dict:
checkpoint_dir = self.checkpoint_dir / f"step_{step}"

# Verify integrity
manifest_path = checkpoint_dir / "manifest.json"
if manifest_path.exists():
manifest = json.loads(manifest_path.read_text())
for fname, expected_sha in manifest.items():
if fname == "manifest.json":
continue
actual_sha = hashlib.sha256(
(checkpoint_dir / fname).read_bytes()
).hexdigest()
if actual_sha != expected_sha:
raise RuntimeError(f"Checksum mismatch for {fname}")

# Load weights (safe - no code execution)
weights = st.load_file(
str(checkpoint_dir / "model.safetensors"),
device=device
)

meta = json.loads((checkpoint_dir / "meta.json").read_text())
return {"weights": weights, "meta": meta}

Parquet for Training Data at Scale

import pyarrow.parquet as pq
import pyarrow as pa
import pyarrow.compute as pc
from pathlib import Path
from typing import Iterator
import numpy as np

class ParquetMLDataset:
"""
Efficient ML training data loading from Parquet.
- Column pruning: only read needed columns
- Predicate pushdown: skip irrelevant row groups
- Batch iterator: yield Arrow tables for zero-copy to PyTorch
"""

def __init__(self, data_dir: str, feature_cols: list,
label_col: str, batch_size: int = 65536):
self.dataset = pq.ParquetDataset(
data_dir,
use_legacy_dataset=False # use new API for better predicate support
)
self.feature_cols = feature_cols
self.label_col = label_col
self.batch_size = batch_size
self.columns = feature_cols + [label_col]

def batches(self, filters=None) -> Iterator:
"""Yield (features, labels) as NumPy arrays."""
for batch in self.dataset.to_batches(
columns=self.columns,
batch_size=self.batch_size,
filter=filters,
):
# Arrow RecordBatch to NumPy: zero-copy for numeric types
features = np.stack([
batch.column(col).to_numpy()
for col in self.feature_cols
], axis=1)
labels = batch.column(self.label_col).to_numpy()
yield features, labels

def class_counts(self) -> dict:
"""Count labels using Parquet statistics - no full data scan."""
# Read only the label column across all row groups
label_data = pq.read_table(
self.dataset.files[0], # simplified - should scan all files
columns=[self.label_col]
)
counts = pc.value_counts(label_data.column(self.label_col))
return {
item["values"].as_py(): item["counts"].as_py()
for item in counts.to_pylist()
}


# Usage in a PyTorch training loop
import torch

dataset = ParquetMLDataset(
"training_data/",
feature_cols=["f1", "f2", "f3", "f4", "f5"],
label_col="label",
batch_size=4096,
)

for features_np, labels_np in dataset.batches():
# Convert to tensors - from_numpy avoids copy (shares buffer)
x = torch.from_numpy(features_np).float()
y = torch.from_numpy(labels_np).long()
# ... training step ...

orjson: Drop-In JSON Replacement

import orjson
import json
import numpy as np
import time

# orjson is a drop-in replacement for json with:
# - Native numpy/datetime support
# - 5-10x faster serialization
# - Stricter correctness (no NaN/Infinity by default)

data = {
"embeddings": np.random.randn(512).astype(np.float32),
"metadata": {"model": "bert-base", "version": 3},
"scores": [0.91, 0.87, 0.73],
}

# Standard json fails on numpy arrays without manual .tolist()
# json.dumps(data) # TypeError: Object of type ndarray is not JSON serializable

# orjson handles numpy arrays natively
serialized = orjson.dumps(data) # returns bytes, not str
deserialized = orjson.loads(serialized)

# Benchmark
N = 50_000
payload = {"scores": np.random.randn(1000).astype(np.float32).tolist()}

start = time.perf_counter()
for _ in range(N):
_ = json.dumps(payload).encode()
json_time = time.perf_counter() - start

start = time.perf_counter()
for _ in range(N):
_ = orjson.dumps(payload)
orjson_time = time.perf_counter() - start

print(f"json: {json_time:.3f}s")
print(f"orjson: {orjson_time:.3f}s")
print(f"Speedup: {json_time / orjson_time:.1f}x")
# Typical: 5-8x speedup

# orjson options
result = orjson.dumps(
data,
option=orjson.OPT_INDENT_2 # pretty-print
| orjson.OPT_NON_STR_KEYS # allow non-string dict keys
| orjson.OPT_SERIALIZE_NUMPY # explicit numpy handling
| orjson.OPT_SORT_KEYS # deterministic output
)

Common Mistakes

:::danger torch.load Without weights_only=True Executes Arbitrary Code Any code in the pickle stream of a .pt checkpoint file is executed during torch.load(). This is not a theoretical risk - proof-of-concept exploits exist for public model repositories. Always use torch.load(path, weights_only=True) for any checkpoint you did not create yourself. For new code, use safetensors entirely. The weights_only=True parameter raises an error if the checkpoint contains non-tensor objects, which is the correct behavior. :::

:::danger Pickle Encodes Class References by Module Path torch.save(model, path) pickles the entire model including class definitions by name. If the class is defined in __main__, it can only be loaded in an environment where __main__ contains that class. This breaks across containers, across training vs serving environments, and across Python version upgrades. Always save model.state_dict(), not the model object itself. :::

:::warning Parquet Row Group Size Affects Query Performance Dramatically Too-large row groups (e.g. an entire 10 GB file as one row group) defeat predicate pushdown - the statistics cover the entire file, so the reader always has to scan everything. Too-small row groups add per-row-group overhead. The default of 128 MB is a reasonable starting point. For time-series or event data where you frequently filter by a time range, sort by the time column before writing so each row group covers a narrow time window. :::

:::warning JSON NaN and Infinity Are Not Valid JSON Python's json.dumps will serialize float('nan') and float('inf') as NaN and Infinity - which are not valid JSON. Most parsers reject these values. ML models frequently produce NaN in loss values, gradient norms, and embedding similarity scores. Always validate and handle NaN/Inf before JSON serialization. orjson raises TypeError for NaN/Inf by default, which is the correct behavior that surfaces the problem early. :::

:::warning HDF5 Files Are Not Concurrent-Write Safe Multiple processes writing to the same HDF5 file concurrently will corrupt the file. HDF5's parallel I/O (via MPI) is the correct solution for multi-process write workloads, but it requires building h5py with MPI support. The common mistake is using HDF5 as a shared experiment log from multiple training processes. Use separate files per process and merge after training, or use a database (SQLite with WAL mode, or a proper experiment tracker like MLflow). :::

:::warning Protocol Buffer repeated float Fields Must Use packed = true Without packed = true, each element in a repeated float field gets its own field-number tag prefix (1-3 bytes per element). A 256-dimensional embedding with per-element tags uses 256 * (1 tag + 4 float) = 1280 bytes instead of 1 tag + 256 * 4 = 1029 bytes. For embedding features processed at millions of records per second, this 25% overhead is measurable. Always use [packed = true] for repeated numeric fields in proto3 (it is the default in proto3 but explicit is clearer). :::

Interview Questions and Answers

Q1: What is the security problem with torch.load() and how do you fix it?

torch.load() uses Python's pickle module to deserialize model checkpoints. Pickle is a serialization protocol that encodes object construction as a sequence of opcodes including GLOBAL (import a class by module name) and REDUCE (call a callable with arguments). A malicious actor can craft a .pt file with opcodes that call os.system("rm -rf /") or establish a reverse shell. This code executes during the load() call, before any validation is possible.

The fix has two tiers. For loading existing .pt files from untrusted sources: use torch.load(path, weights_only=True). This restricts the deserializer to only reconstruct tensors, arrays, and a whitelist of safe types, raising an error if the pickle stream contains anything else. For new model saving workflows: switch to safetensors entirely. Safetensors stores a JSON header with tensor metadata and raw binary data. There is no execution during load - the loader reads the JSON, validates offsets, and returns pointers into the memory-mapped file. It is also 5-10x faster than pickle for loading large models because it skips the opcode interpretation step and directly maps tensor data.

Q2: What is the difference between Arrow IPC file format and stream format, and when do you use each?

Both formats use the same Arrow record batch encoding, but differ in how batches are framed and indexed.

The stream format writes record batches sequentially with a schema message at the start and an end-of-stream marker at the end. There is no index. You read from start to end without seeking. This format is appropriate for network streaming (Kafka messages, gRPC streams, Unix pipes) where the consumer processes batches in order and seeking is not possible or not needed.

The file format (also called the IPC file format or Feather v2) writes record batches sequentially and then writes a footer containing the schema and the file offsets of every record batch. This enables random access: you can seek directly to batch N without reading batches 0 through N-1. It also enables memory mapping: you can mmap() the file and each batch is just a region within the file, requiring no copy. This format is appropriate for local storage where random access and memory mapping matter.

For ML pipelines: use stream format for Kafka-based feature pipelines, use file format with memory mapping for feature stores on local or NFS storage.

Q3: Explain columnar storage in Parquet and why it benefits ML workloads.

Row-oriented storage stores all fields of a record together: [user_id, age, clicks, embedding, label] for record 1, then the same for record 2, and so on. To query a single column across 1 million rows, you must read every row's worth of bytes and skip the unwanted fields.

Columnar storage stores all values of a single field together: all 1 million user_id values in a contiguous buffer, then all 1 million age values, and so on. To read one column across 1 million rows, you read exactly that column's data - no other bytes are touched.

For ML workloads, this is critical because training pipelines typically access every row but only a few columns. A feature store with 200 columns might train a model using 15 features and a label. Row-oriented: read 200 columns, use 16. Columnar: read 16 columns. For a 10 TB dataset, that is the difference between reading 10 TB and reading 800 GB.

Parquet adds two more benefits. Compression works significantly better on columnar data because values of the same type and meaning are stored together - they have similar distributions, enabling higher compression ratios. And Parquet stores per-column-chunk statistics (min, max, null count), enabling predicate pushdown where the reader skips entire row groups whose statistics prove they cannot contain matching rows.

Q4: How does Protocol Buffers schema evolution work and what constraints must you follow?

Protobuf's backward compatibility is achieved through field numbers. On the wire, every field is encoded as (field_number << 3) | wire_type followed by the value. Field names are not on the wire - only numbers. This means you can rename a field in the .proto file without breaking existing serialized data.

The rules for safe evolution:

  • Adding a new field is always safe. Old readers see an unknown field number, check their unknown field preservation rules (proto3 preserves unknown fields by default), and continue. New readers get the default value (0/empty/"") if the field is missing from old data.
  • Removing a field is safe if you mark it as reserved. The field number and name are reserved to prevent future reuse. Reusing a field number for a different type or semantics is the most dangerous mistake - old data with the old field's encoding will be misinterpreted as the new field's type.
  • Changing a field's type is generally breaking unless the types share the same wire type and the values are compatible (e.g., int32 to int64 is safe because both are varint; float to double is breaking because they use different wire types).
  • Changing from optional to repeated is breaking. Changing from repeated to optional truncates to the last element.

The practical rule for production systems: treat field numbers as immutable. Never reuse a field number. Add new fields at the end. Remove fields by marking reserved, never by deleting. Document field numbers in code review.

Q5: When would you choose MessagePack over Protocol Buffers for a production ML feature pipeline?

MessagePack is appropriate when: the schema is dynamic (features vary per user or event type, making a fixed protobuf schema impractical), the pipeline is polyglot but you do not want the overhead of schema compilation and code generation, and you are optimizing for wire size vs JSON without the operational complexity of managing .proto files.

Protocol Buffers are appropriate when: the schema is stable and well-defined, you need strict backward compatibility with automatic default handling, you need language-agnostic schema documentation (the .proto file is the contract), and you benefit from protobuf's code generation (typed accessors, serialization, parsing all generated from the schema).

In practice at ML companies: protobuf dominates for service-to-service feature passing where the schema is owned by a team and evolves formally (like the ONNX model format or a feature store's API). MessagePack dominates for Kafka-based event streaming where the schema is semi-structured and producers change faster than a formal schema review process allows. JSON is used for configuration, experiment metadata, and any context where human readability outweighs throughput.

Q6: What is memory mapping and how does safetensors exploit it for large model loading?

Memory mapping (mmap) is a Linux kernel feature that maps a file's contents into the process's virtual address space. The file's bytes appear at a range of virtual memory addresses, but the kernel does not actually read the file into RAM until those addresses are accessed. When a page of virtual memory is accessed for the first time, the CPU triggers a page fault, the kernel reads the corresponding 4 KB page from the file, and execution continues. Subsequent accesses to the same page are instant because the data is in the page cache.

For a 70B parameter safetensors file (140 GB in float16), memory mapping means: the file is "opened" in microseconds, only the pages that are actually accessed are loaded into RAM, pages that are accessed once and not touched again can be evicted to make room for other data, and multiple processes can share the same physical pages if they memory-map the same file.

Safetensors is designed to exploit this. The JSON header at the beginning of the file specifies each tensor's byte offset within the file. To load only the first transformer layer's weights, you access only the virtual addresses corresponding to that tensor's byte range - the kernel loads only those pages. This is why safe_open() and f.get_tensor(key) can selectively load individual tensors from a 140 GB file without loading the entire file into RAM.

The contrast with pickle: pickle stores tensors as a series of opcodes interspersed with binary data. There is no way to seek directly to a specific tensor without parsing the entire opcode stream from the beginning. Pickle cannot be memory-mapped for random tensor access.

Q7: What happens to JSON serialization performance at high throughput and how do you address it?

At 200,000 requests per second with a 1 KB JSON payload per request, the serialization pipeline must process 200 MB/s of JSON encoding. Python's standard json module is implemented in Python (with a C accelerator for some operations) and typically achieves 50-150 MB/s for mixed payloads. At 200 MB/s throughput requirements, JSON encoding is on the critical path.

The first remedy is orjson, a Rust-implemented JSON library that achieves 500 MB/s - 1 GB/s for typical payloads. It is a near-drop-in replacement for json with native numpy support and stricter correctness. For most high-throughput Python services, switching from json to orjson with no other changes achieves a 5-8x speedup.

The second remedy is switching to a binary format. MessagePack with msgpack-python achieves 1-3 GB/s and produces 40-60% smaller payloads than JSON. For a feature serving system, this means 2-3x less network bandwidth and 5-10x less encoding CPU overhead. The cost is loss of human readability and the need for MessagePack decoders in all consumers.

The third remedy, applicable to columnar feature batches, is Arrow IPC. If features are naturally batched (serving a batch of 256 users in one request), Arrow IPC can represent that batch in near-raw-binary format and zero-copy it to a PyTorch tensor for inference. The encoding overhead approaches zero because the in-memory representation is already the wire format.

Summary

Every ML system makes serialization choices - often implicitly and often poorly. The choice of format at each layer compounds into performance or failure at scale.

The key decisions:

  • Model weights: use safetensors. Never use torch.save(model, path). Use torch.save(model.state_dict(), path) at minimum, safetensors for anything that crosses a trust boundary.
  • Training datasets: Parquet with Snappy compression and column statistics. Arrow IPC for in-memory pipelines.
  • Feature streaming: Protocol Buffers if you have a stable schema and need evolution guarantees. MessagePack for dynamic or high-throughput pipelines. orjson if you must use JSON.
  • Scientific/hierarchical data: HDF5 with chunking.
  • Schema evolution: always use field numbers (protobuf) or explicit version tags (JSON/msgpack). Never reuse field identifiers.

The performance differences are not marginal. JSON vs Parquet for a 10 TB training dataset is the difference between a 4-hour and a 22-minute data loading step. pickle vs safetensors for a 70B model is the difference between a 3-minute load (with security risk) and a 28-second memory-mapped load (safe). These choices belong in design reviews, not as afterthoughts.

© 2026 EngineersOfAI. All rights reserved.