Skip to main content

Storage Hierarchy: SSD and NVMe

Reading time: ~35 min · Interview relevance: High · Target roles: ML Engineer, MLOps Engineer, Data Engineer, Systems Engineer

Training a 70B parameter model generates 140GB of checkpoint data every 30 minutes. The training run is bottlenecked at the checkpoint save, not the forward pass. The team is paying for A100 GPU time while those GPUs sit idle waiting for the storage layer. This is a $40,000/hour problem, and it is solved by understanding the storage hierarchy.

The Checkpoint Bottleneck That Costs Millions

A machine learning team at a well-known AI lab is training a large language model on 512 GPUs. Forward passes are fast. Gradient synchronization is fast. But every 30 minutes, the entire training loop pauses for 4 minutes while the model saves a checkpoint to network storage. The GPUs sit idle. The cluster costs 80,000perhour.That4minutepauserepresentsover80,000 per hour. That 4-minute pause represents over 5,000 of wasted compute, repeated 48 times per day: $240,000 per day burned on storage waits.

The root cause is not a code bug. It is a fundamental misunderstanding of the storage hierarchy. The team is writing 140GB of checkpoint data synchronously from the main training process through the Linux kernel's buffered I/O path, over NFS to a SAN (Storage Area Network), using default network-mounted filesystem settings tuned for general purpose use, not for large sequential writes from dozens of parallel GPU workers.

The fix combines three ideas from this lesson. First, async checkpoint saving that does not block the training loop - save to a separate thread that serializes and writes while training continues. Second, local NVMe drives as a fast intermediate layer - write locally first (500 MB/s sustained), then replicate to network storage in the background. Third, direct I/O that bypasses the kernel page cache for large writes (since you will never read this data on this machine again, caching it wastes RAM).

This example illustrates why understanding the storage hierarchy matters for ML engineers. Storage is not "someone else's problem." It is a first-class architectural concern that directly affects GPU utilization, training cost, and the speed at which experiments can iterate.

The storage hierarchy in a modern ML training cluster spans six orders of magnitude in latency: L1 cache at 1ns, DRAM at 100ns, NVMe at 100 microseconds, SATA SSD at 500 microseconds, spinning disk at 10 milliseconds, and network-attached storage at tens of milliseconds. Understanding where each dataset, checkpoint, and intermediate result should live in this hierarchy - and how to move data between layers efficiently - is a core systems skill for ML practitioners.

In this lesson, we will cover the physics of NAND flash, why NVMe is so much faster than SATA, how Linux's io_uring gives you kernel-bypassing async I/O, and how to design storage architectures for ML training at scale.

Why This Exists - The Physics of Persistent Storage

For decades, "persistent storage" meant spinning magnetic disks. A disk has a physical read head that must physically move to the right track (seek time: 8-15ms), wait for the right sector to rotate under the head (rotational latency: 0-6ms), then read or write data at the speed the platter spins (50-200 MB/s). The total latency for a random access is 10-20ms. For sequential reads, throughput is reasonable. For random access patterns - typical of database lookups, key-value stores, and embedding table accesses - spinning disks are catastrophically slow.

Flash memory changes this completely. Flash is a semiconductor memory technology where charge is stored in floating-gate transistors. There are no moving parts. Random access latency drops to microseconds. But flash has its own physics that engineers must understand to use it well.

Flash cells wear out. Every program/erase (P/E) cycle degrades the oxide layer surrounding the floating gate, eventually causing charge leakage and data errors. The number of P/E cycles a cell can survive depends on how many bits it stores: Single-Level Cell (SLC) stores 1 bit and survives 100,000 cycles; Multi-Level Cell (MLC) stores 2 bits and survives 10,000 cycles; Triple-Level Cell (TLC) stores 3 bits and survives 1,000-3,000 cycles; Quad-Level Cell (QLC) stores 4 bits and survives 100-1,000 cycles. Consumer drives use TLC/QLC (higher density, lower cost, lower endurance). Enterprise drives use MLC or SLC cache layers for hot data.

This wear-out problem drives the need for a Flash Translation Layer (FTL) - firmware running inside every SSD that remaps logical block addresses to physical flash locations, distributes writes evenly across cells (wear leveling), and manages bad blocks. The FTL is invisible to the OS but dramatically affects SSD performance and endurance characteristics.

Historical Context - From SATA to NVMe

Solid-state drives first shipped with SATA interfaces because SATA was the universal storage bus present in every PC and server. SATA 3.0 (2009) maxes out at 600 MB/s bandwidth. This was a hard protocol ceiling: even as NAND flash became faster, the SATA interface could not transfer data any faster.

The deeper problem was the AHCI (Advanced Host Controller Interface) command protocol designed for spinning disks. AHCI assumes high latency and supports only one command queue per device with 32 entries. For SSDs with sub-100 microsecond latency, this single queue became a serialization bottleneck.

NVMe (Non-Volatile Memory Express) was designed from scratch for flash, standardized in 2011. It connects SSDs directly to the PCIe bus, bypassing the SATA controller entirely. PCIe 4.0 x4 provides 8 GB/s of bandwidth. More importantly, NVMe supports up to 65,535 queues with up to 65,535 entries each - allowing massively parallel I/O from multiple CPU cores simultaneously. NVMe also reduces software overhead: NVMe I/O requires 2 PCIe transactions to submit a command versus 4+ for AHCI.

The result: where SATA SSDs plateau at ~550 MB/s sequential read and ~100,000 IOPS random read, consumer NVMe drives reach 7,000 MB/s sequential read and 1,000,000+ IOPS random read. Enterprise NVMe drives (used in ML training clusters) push even further.

NAND Flash Architecture and Endurance

Understanding NAND flash internals helps explain SSD behavior that otherwise seems puzzling.

Flash can only erase at the block granularity (256KB to 4MB) but writes at the page granularity (4-16KB). This mismatch creates the "write amplification" problem: to update a single 4KB page, the FTL must read the entire block, erase it, modify the target page, and write the whole block back. This amplification factor (how many physical bytes are written per logical byte) directly impacts both performance and endurance.

Write Cliff and SLC Cache

Most consumer and enterprise SSDs ship with an SLC cache: a portion of the NAND configured in single-bit mode (fast, durable) that absorbs incoming writes. Data is absorbed into the SLC cache at 2-4 GB/s, then flushed to TLC/QLC in the background at a lower rate. When the SLC cache fills up (this happens if you write to the drive faster than it can flush), writes slow dramatically - from 2000 MB/s to 200-400 MB/s. This is the "write cliff."

For ML checkpointing, this matters: a 140GB checkpoint write will almost certainly exceed the SLC cache capacity (typically 10-50GB on enterprise drives) and hit the write cliff. Knowing this leads you to design checkpointing systems that either: (1) stream writes slowly enough to stay below the SLC flush threshold, or (2) explicitly account for the write cliff by measuring actual sustained write throughput.

NVMe Protocol and PCIe Bandwidth

PCIe bandwidth per generation:

PCIe Genx4 Bandwidthx16 BandwidthYear
Gen 34 GB/s16 GB/s2010
Gen 48 GB/s32 GB/s2017
Gen 516 GB/s64 GB/s2022

Modern ML training servers connect NVMe drives in PCIe Gen 4 or Gen 5 x4 slots. Multiple drives in a RAID-0 stripe can aggregate bandwidth: 8 NVMe Gen 4 drives in RAID-0 theoretically provide 56 GB/s sequential read throughput.

Linux I/O Stack and Direct I/O

The Linux I/O stack has several layers, each adding latency and complexity. Understanding which layer to bypass is essential for high-performance storage.

Buffered I/O (default): data passes through the kernel page cache. Reads that hit the page cache are served from RAM at memory speed. Writes are buffered in RAM and flushed to disk asynchronously. The downside: for large sequential writes (checkpoints) that will never be re-read on this machine, the page cache wastes RAM and adds cache management overhead.

Direct I/O (O_DIRECT): bypasses the page cache entirely. Data transfers directly between user-space buffers and the storage device. Requirements: buffer must be 512-byte or 4096-byte aligned, transfer sizes must be multiples of the block size. Ideal for: checkpoint writes (no re-read benefit from caching), large model file reads into GPU memory (model will be loaded into CUDA buffers, not RAM), database-style workloads managing their own cache.

io_uring - Kernel-Bypassing Async I/O

Traditional async I/O in Linux (POSIX aio, libaio) has significant limitations: no support for buffered I/O, high syscall overhead, poor integration with modern kernels. io_uring, introduced by Jens Axboe in Linux 5.1 (2019), fixes all of this.

The key insight: io_uring uses two shared memory ring buffers between kernel and userspace - a submission queue (SQ) and a completion queue (CQ). The application writes I/O requests to the SQ without a syscall, the kernel drains the SQ and posts completions to the CQ. For high-throughput I/O, this eliminates the syscall overhead entirely (using SQPOLL mode, a kernel thread polls the SQ without any application syscall).

"""
io_uring interface via ctypes.
Demonstrates async I/O submission and completion without blocking.

In production, use the liburing Python binding or the aioboto3/aiofiles libraries
which wrap io_uring under the hood on modern kernels.
"""
import ctypes
import ctypes.util
import os
import mmap
import struct
from typing import List, Tuple


# io_uring structures (simplified for illustration)
# Full structures are in /usr/include/linux/io_uring.h

IORING_OP_READ = 22
IORING_OP_WRITE = 23
IORING_OP_FSYNC = 20

class IoUringSqe(ctypes.Structure):
"""Submission Queue Entry - describes one I/O operation."""
_fields_ = [
("opcode", ctypes.c_uint8), # operation type (read/write/fsync)
("flags", ctypes.c_uint8),
("ioprio", ctypes.c_uint16),
("fd", ctypes.c_int32), # file descriptor
("off", ctypes.c_uint64), # file offset
("addr", ctypes.c_uint64), # buffer address
("len", ctypes.c_uint32), # transfer size in bytes
("op_flags", ctypes.c_uint32),
("user_data",ctypes.c_uint64), # user-defined tag returned in CQE
("pad", ctypes.c_uint64 * 3),
]


class IoUringCqe(ctypes.Structure):
"""Completion Queue Entry - result of one completed I/O operation."""
_fields_ = [
("user_data", ctypes.c_uint64), # matches user_data from SQE
("res", ctypes.c_int32), # result: bytes transferred or -errno
("flags", ctypes.c_uint32),
]


# High-level async I/O demonstration using Python's asyncio + aiofiles
# (which uses io_uring on Linux 5.1+ kernels)
import asyncio
import aiofiles
from pathlib import Path


async def async_checkpoint_save(
model_state: bytes,
checkpoint_path: str,
use_direct_io: bool = True,
) -> None:
"""
Save a model checkpoint asynchronously.

Uses O_DIRECT to bypass page cache for large sequential writes.
The calling coroutine can continue with the next training step
while this write completes in the background.

Args:
model_state: serialized model bytes (from torch.save to BytesIO)
checkpoint_path: destination path
use_direct_io: bypass page cache (recommended for large checkpoints)
"""
flags = os.O_WRONLY | os.O_CREAT | os.O_TRUNC
if use_direct_io:
flags |= os.O_DIRECT

# Align buffer to 4096 bytes (required for O_DIRECT)
if use_direct_io:
aligned_size = ((len(model_state) + 4095) // 4096) * 4096
# Allocate 4096-aligned buffer
buf = (ctypes.c_char * aligned_size)()
ctypes.memmove(buf, model_state, len(model_state))
write_data = bytes(buf)[:aligned_size]
else:
write_data = model_state

async with aiofiles.open(checkpoint_path, mode="wb") as f:
await f.write(write_data)


async def async_load_dataset_shard(
shard_paths: List[str],
buffer_size_mb: int = 256,
) -> List[bytes]:
"""
Load multiple dataset shards concurrently using async I/O.
All shard reads are submitted simultaneously - total time is
max(individual shard read times), not sum.
"""
async def read_shard(path: str) -> bytes:
async with aiofiles.open(path, "rb") as f:
return await f.read()

# Submit all reads concurrently
results = await asyncio.gather(*[read_shard(p) for p in shard_paths])
return list(results)


# Low-level O_DIRECT write using ctypes (no asyncio required)
def direct_io_write(data: bytes, path: str, block_size: int = 4096) -> int:
"""
Write data using O_DIRECT, bypassing the kernel page cache.
Buffer must be aligned to block_size (4096 for most filesystems).
Returns bytes written.
"""
# Allocate page-aligned buffer using mmap (always page-aligned)
aligned_size = ((len(data) + block_size - 1) // block_size) * block_size
aligned_buf = mmap.mmap(-1, aligned_size)
aligned_buf.write(data)
aligned_buf.seek(0)

fd = os.open(path, os.O_WRONLY | os.O_CREAT | os.O_TRUNC | os.O_DIRECT, 0o644)
try:
total_written = 0
while total_written < aligned_size:
chunk = aligned_buf.read(min(1024 * 1024, aligned_size - total_written))
if not chunk:
break
written = os.write(fd, chunk)
total_written += written
finally:
os.close(fd)
aligned_buf.close()

return total_written


def measure_storage_bandwidth(path: str = "/tmp/bw_test",
size_gb: float = 1.0) -> dict:
"""
Measure sequential read and write bandwidth for a storage device.
Tests both buffered and direct I/O paths.

Args:
path: path on the filesystem/device to test
size_gb: test file size in GB

Returns:
dict with read_bw_gbps and write_bw_gbps for buffered and direct I/O
"""
import time
import tempfile

size_bytes = int(size_gb * 1024 * 1024 * 1024)
chunk_size = 64 * 1024 * 1024 # 64MB chunks
data_chunk = b'\x00' * chunk_size

results = {}

# Test 1: Buffered sequential write
start = time.perf_counter()
with open(path + "_buf", "wb") as f:
bytes_written = 0
while bytes_written < size_bytes:
write_size = min(chunk_size, size_bytes - bytes_written)
f.write(data_chunk[:write_size])
bytes_written += write_size
f.flush()
os.fsync(f.fileno())
elapsed = time.perf_counter() - start
results["buffered_write_gbps"] = size_gb / elapsed

# Test 2: Buffered sequential read (drop page cache first for honest result)
try:
os.system("sync && echo 3 > /proc/sys/vm/drop_caches")
except Exception:
pass # requires root

start = time.perf_counter()
with open(path + "_buf", "rb") as f:
while f.read(chunk_size):
pass
elapsed = time.perf_counter() - start
results["buffered_read_gbps"] = size_gb / elapsed

# Cleanup
for suffix in ["_buf"]:
try:
os.unlink(path + suffix)
except FileNotFoundError:
pass

return results

Memory-Mapped Files for Large Datasets

For datasets too large to fit in RAM, memory-mapped files provide a powerful abstraction: the OS maps the file into the virtual address space, and individual records are loaded from disk on demand (demand paging). This is especially useful for large embedding tables and tokenized training corpora.

"""
Memory-mapped file access for large ML datasets.
Enables random access to datasets larger than available RAM.
"""
import mmap
import numpy as np
import struct
import os
from pathlib import Path
import torch
from torch.utils.data import Dataset


class MmapDataset(Dataset):
"""
PyTorch Dataset backed by memory-mapped NumPy arrays.
Ideal for large tokenized corpora that do not fit in RAM.

File format:
Header: [n_samples: uint64, seq_len: uint64, dtype_str: 8 bytes]
Data: [n_samples * seq_len * dtype_itemsize bytes]
"""
HEADER_SIZE = 8 + 8 + 8 # 3 x uint64

def __init__(self, path: str, dtype=np.int32):
self.path = path
self.dtype = dtype

with open(path, "rb") as f:
header = f.read(self.HEADER_SIZE)

self.n_samples, self.seq_len, _ = struct.unpack("QQQ", header)
self.itemsize = np.dtype(dtype).itemsize

# Open the memory map - does NOT read the file into RAM
self._file = open(path, "rb")
self._mmap = mmap.mmap(
self._file.fileno(), 0,
access=mmap.ACCESS_READ
)

def __len__(self) -> int:
return self.n_samples

def __getitem__(self, idx: int) -> torch.Tensor:
"""
Fetch one sample. Only the requested bytes are loaded from disk
(via OS demand paging). First access may trigger a page fault (slow),
subsequent accesses to the same page are from RAM (fast).
"""
if idx >= self.n_samples:
raise IndexError(f"Index {idx} out of range ({self.n_samples} samples)")

offset = self.HEADER_SIZE + idx * self.seq_len * self.itemsize
end = offset + self.seq_len * self.itemsize

raw = self._mmap[offset:end]
arr = np.frombuffer(raw, dtype=self.dtype).copy()
return torch.from_numpy(arr)

def prefetch_sequential(self, start_idx: int, count: int):
"""
Advise the OS to prefetch a range of samples.
Uses madvise(MADV_SEQUENTIAL) to trigger read-ahead.
"""
offset = self.HEADER_SIZE + start_idx * self.seq_len * self.itemsize
length = count * self.seq_len * self.itemsize
mmap.mmap.madvise(self._mmap, mmap.MADV_SEQUENTIAL, offset, length)

def prefetch_random(self, start_idx: int, count: int):
"""
Advise the OS that access will be random - disable read-ahead.
Use this for random-access embedding table lookups.
"""
offset = self.HEADER_SIZE + start_idx * self.seq_len * self.itemsize
length = count * self.seq_len * self.itemsize
mmap.mmap.madvise(self._mmap, mmap.MADV_RANDOM, offset, length)

def __del__(self):
if hasattr(self, '_mmap') and self._mmap:
self._mmap.close()
if hasattr(self, '_file') and self._file:
self._file.close()


def write_mmap_dataset(data: np.ndarray, path: str):
"""
Write a 2D array to the MmapDataset format.
data shape: [n_samples, seq_len]
"""
n_samples, seq_len = data.shape
header = struct.pack("QQQ", n_samples, seq_len, 0)

with open(path, "wb") as f:
f.write(header)
data.tofile(f)

print(f"Written {n_samples:,} samples x {seq_len} tokens to {path}")
print(f"File size: {os.path.getsize(path) / 1e9:.2f} GB")

Async Checkpoint Saving with PyTorch

"""
Async checkpoint saving for PyTorch training loops.
Key idea: serialization and disk write happen in a background thread,
so the training loop does not stall waiting for storage.
"""
import torch
import threading
import queue
import io
import os
import time
from typing import Dict, Optional
from pathlib import Path
import logging

logger = logging.getLogger(__name__)


class AsyncCheckpointSaver:
"""
Non-blocking checkpoint saver for PyTorch training.

Usage:
saver = AsyncCheckpointSaver(checkpoint_dir="/mnt/nvme/checkpoints")
saver.start()

for step in range(total_steps):
loss = train_step(model, optimizer)
if step % 1000 == 0:
# This returns immediately - does not block training
saver.save_async(
state=model.state_dict(),
optimizer_state=optimizer.state_dict(),
step=step,
)

saver.wait_and_stop() # wait for any in-flight writes before exiting
"""

def __init__(
self,
checkpoint_dir: str,
keep_last_n: int = 3,
use_direct_io: bool = True,
buffer_size_mb: int = 128,
):
self.checkpoint_dir = Path(checkpoint_dir)
self.checkpoint_dir.mkdir(parents=True, exist_ok=True)
self.keep_last_n = keep_last_n
self.use_direct_io = use_direct_io
self.buffer_size = buffer_size_mb * 1024 * 1024

self._queue = queue.Queue(maxsize=2) # back-pressure if write is slow
self._thread: Optional[threading.Thread] = None
self._active = False
self._saved_steps = []

def start(self):
"""Start the background writer thread."""
self._active = True
self._thread = threading.Thread(
target=self._writer_loop,
daemon=True,
name="checkpoint-writer"
)
self._thread.start()
logger.info("AsyncCheckpointSaver started")

def save_async(self, state: dict, optimizer_state: dict, step: int):
"""
Enqueue a checkpoint for async saving.
If the queue is full (writer is busy), this blocks until space is available.
This provides back-pressure: you cannot queue more than 2 checkpoints.
"""
# Serialize to bytes in the calling thread (on CPU, not blocking disk)
buf = io.BytesIO()
torch.save({
"step": step,
"model_state": state,
"optimizer_state": optimizer_state,
}, buf)
checkpoint_bytes = buf.getvalue()

checkpoint_path = self.checkpoint_dir / f"checkpoint_step_{step:08d}.pt"

try:
self._queue.put(
(checkpoint_bytes, checkpoint_path, step),
timeout=30.0 # warn if writer is more than 30s behind
)
except queue.Full:
logger.warning("Checkpoint queue full - training stalled waiting for writer")
self._queue.put((checkpoint_bytes, checkpoint_path, step))

logger.debug(f"Enqueued checkpoint for step {step} ({len(checkpoint_bytes)/1e6:.1f} MB)")

def _writer_loop(self):
"""Background thread: drain the queue and write checkpoints to disk."""
while self._active:
try:
item = self._queue.get(timeout=1.0)
except queue.Empty:
continue

checkpoint_bytes, path, step = item
t_start = time.perf_counter()

self._write_checkpoint(checkpoint_bytes, path)

elapsed = time.perf_counter() - t_start
size_mb = len(checkpoint_bytes) / 1e6
bw_mbps = size_mb / elapsed

logger.info(
f"Checkpoint saved: step={step}, "
f"size={size_mb:.0f}MB, "
f"time={elapsed:.1f}s, "
f"bandwidth={bw_mbps:.0f}MB/s"
)

self._saved_steps.append((step, str(path)))
self._prune_old_checkpoints()
self._queue.task_done()

def _write_checkpoint(self, data: bytes, path: Path):
"""Write checkpoint bytes to disk. Uses O_DIRECT if configured."""
if self.use_direct_io:
# Pad to 512-byte boundary (required for O_DIRECT)
aligned_size = ((len(data) + 511) // 512) * 512
padded = data + b'\x00' * (aligned_size - len(data))

fd = os.open(str(path), os.O_WRONLY | os.O_CREAT | os.O_TRUNC | os.O_DIRECT, 0o644)
try:
offset = 0
while offset < len(padded):
chunk = padded[offset:offset + self.buffer_size]
written = os.write(fd, chunk)
offset += written
finally:
os.close(fd)
else:
# Buffered write with explicit fsync to ensure durability
path.write_bytes(data)
with open(path, "rb+") as f:
os.fsync(f.fileno())

def _prune_old_checkpoints(self):
"""Remove checkpoints beyond keep_last_n."""
if len(self._saved_steps) > self.keep_last_n:
oldest_step, oldest_path = self._saved_steps.pop(0)
try:
os.unlink(oldest_path)
logger.debug(f"Pruned old checkpoint: step={oldest_step}")
except FileNotFoundError:
pass

def wait_and_stop(self):
"""Wait for all queued checkpoints to finish, then stop the writer."""
self._queue.join()
self._active = False
if self._thread:
self._thread.join(timeout=60)
logger.info("AsyncCheckpointSaver stopped")

Storage Class Memory and Optane

Intel Optane (3D XPoint technology, discontinued 2022 but still deployed) represented a new tier between DRAM and NAND SSD. Optane DIMM provided byte-addressable persistent memory at 300-400 ns access latency (vs 100ns DRAM, vs 100,000ns NVMe), with DRAM-like sequential bandwidth (40-50 GB/s) and persistence across power cycles.

For ML workloads, Optane DIMMs (in "App Direct" mode) were used as:

  1. An ultra-fast checkpoint destination - write 140GB checkpoint in 3-4 seconds instead of 30-60 seconds
  2. A dataset staging area - the hot portion of a dataset stored in persistent memory, surviving GPU crashes without reloading from object storage
  3. A key-value store backing - recommendation system embedding tables stored in persistent memory for sub-microsecond lookup

Optane is discontinued, but the concept lives on in emerging CXL (Compute Express Link) memory expansion products that provide similar characteristics.

RAID and Parallel Storage for ML

"""
Storage bandwidth estimation and RAID configuration for ML training.
"""
from dataclasses import dataclass
from typing import List


@dataclass
class StorageDevice:
name: str
read_bw_gbps: float # sequential read bandwidth
write_bw_gbps: float # sequential write bandwidth
read_iops: int # random read IOPS (4KB blocks)
write_iops: int # random write IOPS
capacity_tb: float
cost_per_tb_usd: float
endurance_tbw: float # total bytes written before wear-out


STORAGE_CATALOG = {
"sata_hdd": StorageDevice(
name="SATA HDD (7200 RPM)",
read_bw_gbps=0.2, write_bw_gbps=0.2,
read_iops=200, write_iops=200,
capacity_tb=20.0, cost_per_tb_usd=20,
endurance_tbw=999_999,
),
"sata_ssd_tlc": StorageDevice(
name="SATA SSD TLC",
read_bw_gbps=0.55, write_bw_gbps=0.52,
read_iops=95_000, write_iops=85_000,
capacity_tb=4.0, cost_per_tb_usd=80,
endurance_tbw=600,
),
"nvme_gen4_tlc": StorageDevice(
name="NVMe PCIe Gen4 TLC",
read_bw_gbps=7.0, write_bw_gbps=6.5,
read_iops=1_000_000, write_iops=700_000,
capacity_tb=4.0, cost_per_tb_usd=120,
endurance_tbw=700,
),
"nvme_gen4_enterprise": StorageDevice(
name="NVMe PCIe Gen4 Enterprise MLC",
read_bw_gbps=7.0, write_bw_gbps=4.0,
read_iops=1_500_000, write_iops=500_000,
capacity_tb=6.4, cost_per_tb_usd=350,
endurance_tbw=12_000, # 17x more durable
),
}


def estimate_raid0_bandwidth(
n_drives: int,
device_key: str = "nvme_gen4_tlc",
) -> dict:
"""
Estimate RAID-0 aggregate bandwidth.
RAID-0 stripes data across all drives - both read and write bandwidth scale linearly.
No redundancy: failure of any drive loses all data.
"""
dev = STORAGE_CATALOG[device_key]
return {
"drives": n_drives,
"device": dev.name,
"read_bw_gbps": dev.read_bw_gbps * n_drives,
"write_bw_gbps": dev.write_bw_gbps * n_drives,
"capacity_tb": dev.capacity_tb * n_drives,
"total_cost_usd": dev.capacity_tb * dev.cost_per_tb_usd * n_drives,
}


def checkpoint_time_analysis(
model_size_gb: float,
n_drives: int = 8,
device_key: str = "nvme_gen4_tlc",
checkpoint_freq_steps: int = 1000,
training_step_sec: float = 0.5,
) -> dict:
"""
Analyze the impact of checkpoint saving on GPU utilization.
"""
config = estimate_raid0_bandwidth(n_drives, device_key)

write_bw_gbps = config["write_bw_gbps"]
checkpoint_time_sec = model_size_gb / write_bw_gbps

total_training_time_sec = checkpoint_freq_steps * training_step_sec
checkpoint_overhead_pct = (checkpoint_time_sec / total_training_time_sec) * 100

print(f"\nCheckpoint Analysis:")
print(f" Model size: {model_size_gb:.0f} GB")
print(f" Storage: {n_drives}x {config['device']}")
print(f" Write bandwidth: {write_bw_gbps:.1f} GB/s")
print(f" Checkpoint save time: {checkpoint_time_sec:.1f} seconds")
print(f" Checkpoint frequency: every {checkpoint_freq_steps} steps")
print(f" Step time: {training_step_sec:.2f} seconds")
print(f" Overhead (sync): {checkpoint_overhead_pct:.1f}% of training time")

# With async saving (if checkpoint_time < checkpoint_interval)
checkpoint_interval_sec = checkpoint_freq_steps * training_step_sec
async_overhead_pct = max(0, checkpoint_time_sec - checkpoint_interval_sec) / checkpoint_interval_sec * 100

print(f" Overhead (async): {async_overhead_pct:.1f}% "
f"(checkpoint {'hides behind training' if async_overhead_pct == 0 else 'takes longer than interval'})")

return {
"checkpoint_time_sec": checkpoint_time_sec,
"sync_overhead_pct": checkpoint_overhead_pct,
"async_overhead_pct": async_overhead_pct,
}

Filesystem Choices for ML Workloads

Not all filesystems are equal for ML workload patterns. The choice affects throughput, latency, and crash recovery behavior.

ext4 - the Linux default. Good all-around performance, mature, well-tested. Default journal mode (data=ordered) is safe. For ML storage use data=writeback for lower write latency (if you can tolerate brief inconsistency after crash). Mount with noatime to eliminate inode updates on every read.

XFS - better than ext4 for very large files and high-concurrency workloads. Parallel allocation groups allow concurrent directory operations. Preferred for checkpoint storage and large model file repositories.

ZFS - adds checksums, transparent compression, and copy-on-write snapshots. Compression with lz4 is nearly free (CPU cost is lower than disk bandwidth gain) and provides 1.3-2.0x effective capacity improvement on model weights and tokenized datasets. Snapshots make it trivial to roll back dataset preprocessing or model storage to any previous state.

Lustre / GPFS / WEKA - distributed parallel filesystems used in HPC clusters. Lustre separates metadata servers (MDS) from object storage servers (OSS), allowing hundreds of clients to read/write simultaneously. A well-configured Lustre filesystem with many OSS nodes can provide 1 TB/s aggregate bandwidth. The critical tuning parameters for ML workloads: stripe_count (spread each file across N OSS nodes), stripe_size (chunk size per OSS, use 1-4MB for sequential large reads).

# Lustre tuning for ML dataset shards
# Set high stripe count for large dataset files (read by many workers)
lfs setstripe --stripe-count 16 --stripe-size 4M /mnt/lustre/datasets/

# For checkpoint files (written by single process, smaller)
lfs setstripe --stripe-count 4 --stripe-size 1M /mnt/lustre/checkpoints/

# Check stripe layout of a file
lfs getstripe /mnt/lustre/datasets/train_shard_0000.bin

# Mount option for clients: rsize/wsize affect network transfer granularity
mount -t lustre -o rsize=4M,wsize=4M storage-mds@tcp:/ml_fs /mnt/lustre

Production Engineering Notes

NFS and Network Storage Pitfalls

Network-attached storage (NFS, SMB) introduces latency that compounds badly for small random I/O. The critical mistake: running a PyTorch DataLoader with num_workers=8 against an NFS mount where each worker opens, reads, and closes small files. Each file operation has a network round trip at 0.5-2ms. Eight workers generating 100 file operations per second each means 800 network transactions per second per client - this saturates NFS servers that serve dozens of training nodes simultaneously.

The fix: pre-shard datasets into large files (1-10 GB shards), have each DataLoader worker read a complete shard sequentially, and stage shards to local NVMe before training begins. Local NVMe reads at 6-7 GB/s; NFS reads at 1-2 GB/s with optimal tuning and 50-200 MB/s with poor access patterns.

Storage Bandwidth Measurement for Training Pipeline Design

Before committing to a storage architecture, measure actual sustained bandwidth on representative workloads:

# fio: the standard storage benchmarking tool
# Sequential write (checkpoint simulation)
fio --name=seq_write --rw=write --bs=4M --size=100G \
--numjobs=1 --ioengine=io_uring --direct=1 \
--filename=/mnt/nvme/test_file --group_reporting

# Random read (embedding table simulation)
fio --name=rand_read --rw=randread --bs=4K --size=100G \
--numjobs=8 --ioengine=io_uring --direct=1 \
--filename=/mnt/nvme/test_file --iodepth=64 --group_reporting

# Parallel reads from multiple workers (DataLoader simulation)
fio --name=parallel_read --rw=read --bs=64K --size=50G \
--numjobs=16 --ioengine=io_uring --direct=1 \
--filename=/mnt/nvme/test_file --group_reporting

:::warning Sequential Benchmark Results Are Misleading for ML Workloads A drive showing 7 GB/s in a sequential benchmark does not mean your DataLoader will see 7 GB/s. PyTorch DataLoader workers typically read many small files in parallel with random access patterns. Always benchmark with a workload that matches your actual access pattern (random 4K reads for index lookups, sequential 64K reads for shard streaming). :::

:::danger Never Store Training Checkpoints Only on the Training Node's Local Drive Local NVMe drives fail. If your training run's only checkpoint copy is on a local NVMe drive and the drive fails - or the node is preempted in a cloud environment - you lose the entire training run. Always sync checkpoints to at least one additional location: an object store (S3/GCS), a distributed filesystem (Lustre/GPFS), or a RAID array. The cost of the additional write is trivial compared to the cost of losing a multi-week training run. :::

Interview Questions and Answers

Q1: Explain the difference between NVMe and SATA SSD at the protocol level. Why is NVMe 10x faster for random I/O?

The speed difference comes from two sources: the physical interconnect and the command protocol.

For the interconnect: SATA uses the SATA III interface, which maxes out at 600 MB/s regardless of how fast the underlying NAND is. NVMe uses PCIe lanes directly - PCIe 4.0 x4 provides 8 GB/s, PCIe 5.0 x4 provides 16 GB/s. There is no bottleneck from a protocol controller in the I/O path.

For the command protocol: SATA uses AHCI, designed in 2004 for spinning disks. AHCI supports a single command queue with 32 entries. When multiple CPU cores issue I/O simultaneously, they all queue into this single bottleneck. NVMe supports up to 65,535 independent queues with 65,535 entries each. Each CPU core gets its own queue, eliminating contention entirely. For workloads with many concurrent I/O requests (8 DataLoader workers reading simultaneously), NVMe's parallel queues are the critical advantage.

For sequential read bandwidth, the difference is 550 MB/s (SATA) vs 7,000 MB/s (NVMe Gen 4). For random 4K reads, the difference is 100,000 IOPS (SATA) vs 1,000,000+ IOPS (NVMe). The IOPS difference matters for embedding table lookups and dataset index scanning.

Q2: What is write amplification in SSDs and how does it affect ML checkpointing workloads?

Write amplification (WA) is the ratio of physical bytes written to the flash versus logical bytes written by the host. It exists because NAND flash can only erase at the block granularity (256KB - 4MB) but writes at the page granularity (4-16KB). To update a single 4KB page, the FTL must read the entire block, erase it, write back all unchanged pages plus the modified page. This turns one 4KB write into potentially MBs of physical NAND operations.

WA is typically 1.5-3x for sequential writes (fairly efficient) and 5-20x for random small writes (severe). For a 70B parameter model checkpoint of 140GB: with 1.5x WA, the SSD physically writes 210GB per checkpoint. On a drive rated for 700 TBW endurance and a checkpoint frequency of twice per hour for 90 days, that is 2 * 24 * 90 * 210 = 907 TB of physical NAND writes - already exceeding the drive's endurance rating.

The implication: ML training workloads are write-intensive and require enterprise-grade NVMe drives (rated 5,000-25,000 TBW vs consumer drives at 100-700 TBW), or careful management where checkpoints are written to a fast local tier and immediately replicated off to a more durable destination.

Q3: How does io_uring improve on the traditional Linux async I/O model?

Traditional async I/O in Linux (POSIX aio_read/aio_write) has fundamental problems: it only works with O_DIRECT (no buffered I/O support), has high per-operation syscall overhead, uses signals or polling to handle completions (both are inefficient), and cannot express compound operations.

io_uring solves all of these. It uses two shared-memory ring buffers between the kernel and userspace: a Submission Queue (SQ) where the application writes I/O requests, and a Completion Queue (CQ) where the kernel posts results. In SQPOLL mode, a kernel thread continuously polls the SQ - the application never needs to make a syscall to submit I/O. This reduces latency for high-IOPS workloads from microseconds (syscall overhead) to nanoseconds (ring buffer write).

io_uring also supports: buffered and direct I/O (unlike old aio), fixed buffers registered once for zero-copy (avoids repeated page table updates), linked requests (chain operations: write then fsync atomically), and multishot requests (one submission that generates multiple completions). For a DataLoader that needs to issue hundreds of concurrent reads, io_uring with a depth-64 queue and fixed buffers can sustain full NVMe bandwidth with a single thread.

Q4: When should you use O_DIRECT versus buffered I/O for ML workloads?

Use O_DIRECT when:

  • Writing large checkpoints that will not be re-read on this machine. The page cache provides no benefit (the data will not be read from cache) and wastes RAM that could serve other purposes.
  • Reading large model files that will be loaded into GPU memory via CUDA. The data goes from disk through the page cache to user buffer to GPU - the page cache step is wasted.
  • Any workload managing its own caching (custom memory-mapped buffers, embedding tables with their own eviction policy). Double-caching in both your code and the kernel page cache wastes RAM and reduces effective cache capacity.

Use buffered I/O when:

  • Reading dataset shards that DataLoader workers will access multiple times per epoch. The page cache effectively extends RAM, allowing hot shards to be served at memory speed after the first read.
  • Writing small checkpoint metadata files (optimizer state, step count) where access patterns are unpredictable.
  • Any workload on machines with abundant RAM where page cache will actually be hit.

The O_DIRECT requirements are constraining: buffers must be aligned to the block size (4096 bytes for most filesystems), and transfer sizes must be multiples of the block size. This requires careful buffer management in production code.

Q5: How would you design the storage architecture for a 1000-GPU training cluster for large language models?

The design has four tiers, each serving a different access pattern.

Tier 1 - Object storage (S3/GCS): The source of truth for raw datasets, finished model artifacts, and long-term checkpoint archiving. Unlimited capacity, durable, geographically replicated. Latency is hundreds of milliseconds. Only accessed for initial data staging and final artifact storage.

Tier 2 - Parallel distributed filesystem (Lustre or WEKA): Shared storage accessible to all 1000 GPUs simultaneously. Optimized for the training data pipeline. Key design: high stripe count (16-64 OSS nodes per file) for large dataset shards to maximize concurrent read bandwidth. Target: 500+ GB/s aggregate read throughput. Checkpoints written here for shared access and durability.

Tier 3 - Local NVMe per training node: 4-8 NVMe drives per node in RAID-0, providing 20-50 GB/s local bandwidth. Used as: (1) local checkpoint staging (write here first, then replicate to Lustre asynchronously), (2) pre-fetched dataset shard cache (stage shards locally before the training loop starts), (3) fast scratch space for intermediate activations in gradient checkpointing scenarios.

Tier 4 - DRAM on each node: The PyTorch DataLoader prefetch buffer. 32-128GB per node reserved for pre-fetched batches. The pipeline keeps Tier 3 fully utilized by reading ahead while the GPU is busy with the current batch.

The coordination layer is a data pipeline manager that: pre-stages upcoming dataset shards from Lustre to local NVMe 5-10 minutes ahead of when each node will need them, manages checkpoint replication from local NVMe to Lustre asynchronously, and monitors bandwidth utilization across all tiers to detect bottlenecks before they stall training.

Q6: What is the write cliff in SSDs and how do you work around it in a sustained training workload?

The write cliff is the dramatic drop in write throughput that occurs when an SSD's SLC cache fills up. Consumer and enterprise SSDs allocate a portion of NAND in single-level cell (SLC) mode - where each cell stores only 1 bit. SLC writes are fast (2-4 GB/s) and the SLC region absorbs incoming writes immediately. The SSD then flushes SLC to the main TLC/QLC storage in the background at a lower rate (typically 500-800 MB/s sustained for TLC).

If writes arrive faster than the background flush rate, the SLC cache fills. At that point, incoming writes must go directly to TLC/QLC at the native program/erase rate: 200-400 MB/s on consumer drives, 400-800 MB/s on enterprise drives. This is the "cliff" - from 3 GB/s to 300 MB/s.

For ML training checkpointing, three workarounds: (1) Throttle checkpoint writes explicitly to stay below the SLC flush rate - use aiofiles with a bandwidth-limiting wrapper. (2) Use enterprise NVMe drives with larger SLC cache (up to 200GB on some models) and higher sustained write endurance. (3) Design checkpoint saving to write in bursts faster than the SLC fill rate (accept the cliff), then pause for several minutes to let the background flush catch up before the next checkpoint. This works well if checkpoint intervals are long (every 500-1000 training steps at 5-10 seconds per step = 40-160 minutes between checkpoints).

© 2026 EngineersOfAI. All rights reserved.