Memory Safety and Rust
The Production Scenario
In March 2023, the Hugging Face tokenizers library processed a production request on an ML serving cluster that had been running for 47 days without a restart. Somewhere between request 8.3 million and 8.4 million, a tokenizer object was accessed after the thread that created it had terminated - a classic use-after-free scenario in the underlying C extension. The process did not crash immediately. Instead, it silently returned corrupted tokenizations for the next 1,400 requests before a downstream validation check caught the anomaly. Those 1,400 responses were served to users with tokens shuffled, truncated, or replaced with garbage.
This is the characteristic signature of memory safety bugs: they are silent, they corrupt data rather than crashing cleanly, and they are extraordinarily hard to reproduce. The C code that caused the bug had passed all unit tests, all integration tests, and 47 days of production traffic before the race condition manifested. The team spent three weeks debugging it. The fix was a one-line change to ensure the tokenizer object's lifetime was properly managed.
That incident was one of the motivating factors behind Hugging Face's decision to rewrite the tokenizers library in Rust. The rewrite took two months. Since its deployment in 2020, the Rust tokenizers library has processed hundreds of billions of tokenizations in production with zero memory safety incidents. Not fewer - zero.
The contrast is not a coincidence. Rust's ownership and borrowing system makes the class of bug that caused the incident impossible to write. The compiler rejects code with dangling pointers, use-after-free, and data races at compile time. Not with a warning - with an error that prevents compilation. The bug that took three weeks to debug and another two weeks to fix would have been caught in seconds during development.
This lesson is about why memory safety bugs exist, how Rust eliminates them, and how ML engineers can use Rust today - not by rewriting their entire stack, but by writing specific high-performance components in Rust and calling them from Python via PyO3.
Why This Exists
C and C++ are the dominant languages for systems programming because they give programmers direct control over memory: when to allocate, when to free, and exactly how to lay out data structures. That control is why NumPy, PyTorch, TensorFlow, and virtually every ML library's performance-critical code is written in C or C++.
The problem is that manual memory management requires the programmer to be correct about every allocation and every free, in every execution path, under every race condition. Humans are not reliably correct about this. Microsoft has reported that approximately 70% of their security vulnerabilities over the last decade were memory safety bugs. Google reports similar numbers for Chrome. These are not junior developers writing careless code - these are expert teams with extensive review processes and test suites.
The traditional response to this has been tools: Valgrind to detect bugs at runtime, AddressSanitizer to catch bugs during testing, fuzzing to find edge cases, code review to prevent them. These tools help. They do not eliminate the problem, because they are all reactive - they find bugs that exist, rather than preventing bugs from being written.
Rust's response is different. It prevents the entire class of memory safety bugs at compile time, with no runtime overhead, while maintaining performance comparable to C. This is what makes Rust genuinely novel: it is not a safer language at the cost of performance, it is a language that achieves both simultaneously.
Historical Context
Memory safety has been a known problem since the first buffer overflow exploits in the 1970s. The Morris worm (1988) - the first self-propagating internet worm - used a buffer overflow in the Unix fingerd daemon as its primary attack vector. Since then, buffer overflows have been the dominant class of security vulnerability in systems software.
The research community explored many approaches. Java and C# solved memory safety by adding garbage collection - you cannot have dangling pointers if the runtime manages object lifetimes. But garbage collection adds unpredictable pause times and throughput overhead that made these languages unsuitable for the systems programming use cases where C dominated.
Cyclone (2001, AT&T Research) was the first serious attempt at a memory-safe C dialect without GC. It introduced the concept of "regions" - statically-tracked scopes that determine when memory is freed. Cyclone directly influenced Rust's design, particularly its lifetime system.
Rust was designed by Graydon Hoare at Mozilla Research, with the first public release in 2010 and version 1.0 in 2015. The core insight was that ownership - the idea that every value has exactly one owner at any time - combined with a compile-time borrow checker could enforce memory safety with zero runtime overhead. When the owner goes out of scope, the value is freed. No GC required. No dangling pointers possible.
Mozilla used Rust to write Servo, a new browser engine, starting in 2013. Servo demonstrated that Rust could match C++ performance in a complex, real-world codebase. Amazon, Microsoft, Google, and the Linux kernel (which accepted Rust as a second language in 2022) followed.
In the ML ecosystem, Rust appeared first in tooling: the ruff Python linter (10-100x faster than flake8), uv the Python package manager, and Hugging Face's tokenizers library. The candle ML framework (2023, Hugging Face) is a pure-Rust ML framework with CUDA bindings, designed for deployment scenarios where Python overhead is unacceptable.
Core Concepts
The Four Memory Safety Bugs
There are four fundamental classes of memory safety bugs. Understanding them concretely is prerequisite to understanding why Rust's solution is effective.
Buffer overflow: writing past the end of an allocated buffer. The classic example is strcpy(dest, src) where src is longer than dest. The bytes past the end of dest overwrite adjacent memory, which may contain a return address (stack overflow) or another object's data (heap overflow).
Use-after-free: accessing memory after it has been freed. In C:
char *buf = malloc(100);
free(buf);
// buf still contains the old address - it's now a "dangling pointer"
buf[0] = 'X'; // undefined behavior - may crash, corrupt data, or do nothing
Use-after-free is especially insidious because the freed memory is often immediately reallocated for a different purpose. Writing to it corrupts the new allocation. The bug manifests far from its cause.
Double-free: calling free() on the same pointer twice. The heap allocator's metadata structures are corrupted, often leading to arbitrary code execution.
Data race: two threads accessing the same memory location concurrently, with at least one write, without synchronization. The result is undefined behavior - the value read is unpredictable and may be a partially-written combination of the two writes.
All four of these bugs result in undefined behavior in C and C++. The C standard says literally anything can happen. In practice, they result in crashes (if you are lucky), silent data corruption, or security vulnerabilities.
A["Memory Safety Bug Classes"]:::purple
B["Buffer Overflow<br/>write past end of buffer"]:::red
C["Use-After-Free<br/>access freed memory"]:::red
D["Double-Free<br/>free same pointer twice"]:::red
E["Data Race<br/>concurrent unsync'd access"]:::red
F["Crash<br/>(detectable, lucky)"]:::orange
G["Silent Data Corruption<br/>(worst case)"]:::red
H["Security Vulnerability<br/>(arbitrary code exec)"]:::red
A --> B
A --> C
A --> D
A --> E
B --> F
B --> H
C --> G
C --> H
D --> H
E --> G
classDef blue fill:#dbeafe,color:#1e293b,stroke:#2563eb
classDef teal fill:#ccfbf1,color:#134e4a,stroke:#14b8a6
classDef orange fill:#ffedd5,color:#7c2d12,stroke:#ea580c
classDef green fill:#dcfce7,color:#14532d,stroke:#16a34a
classDef purple fill:#ede9fe,color:#4c1d95,stroke:#7c3aed
classDef red fill:#fee2e2,color:#7f1d1d,stroke:#dc2626
Rust Ownership: The Core Model
Rust's safety guarantees rest on three rules enforced at compile time:
- Every value has exactly one owner at any time
- When the owner goes out of scope, the value is automatically freed (its destructor runs)
- You can have either one mutable reference OR any number of immutable references to a value, but not both simultaneously
These three rules, enforced by the borrow checker, make all four classes of memory safety bugs impossible:
- Buffer overflow: Rust's slice types carry their length. Indexing operations are bounds-checked. No way to write past the end of a slice.
- Use-after-free: you cannot hold a reference to a value that has been freed. The borrow checker verifies that all references have shorter lifetimes than the value they reference.
- Double-free: ownership is single - when the owner drops, the value is freed exactly once. You cannot call
free()manually; the compiler generates the drop code. - Data race: the mutable/immutable reference rule prevents two threads from having concurrent access with a mutable reference. Rust's
SendandSynctraits encode thread-safety at the type level.
// This demonstrates Rust's ownership model
// This code COMPILES - showing correct ownership patterns
fn main() {
// OWNERSHIP: s1 owns the String
let s1 = String::from("hello");
// MOVE: ownership transfers to s2, s1 is no longer valid
let s2 = s1;
// println!("{}", s1); // COMPILE ERROR: value moved into s2
println!("{}", s2); // OK: s2 is the owner
// CLONE: explicit copy when you need two owners
let s3 = s2.clone();
println!("{} and {}", s2, s3); // Both valid
// BORROWING: &s2 borrows s2, s2 remains the owner
let len = calculate_length(&s2); // pass a reference
println!("'{}' has length {}", s2, len); // s2 is still valid
// MUTABLE BORROW: exclusive mutable access
let mut s4 = String::from("world");
change(&mut s4);
println!("{}", s4); // "world, changed"
}
// s2, s3, s4 all go out of scope here - memory automatically freed
fn calculate_length(s: &String) -> usize {
s.len()
// s goes out of scope, but does NOT own the String
// the String is NOT freed here
}
fn change(s: &mut String) {
s.push_str(", changed");
}
// USE-AFTER-FREE is impossible:
fn use_after_free_attempt() -> &str { // COMPILE ERROR: lifetime issue
let s = String::from("local string");
&s // ERROR: s is freed when function returns, reference would dangle
}
// DATA RACE is impossible in safe Rust:
use std::sync::{Arc, Mutex};
use std::thread;
fn safe_concurrent() {
let counter = Arc::new(Mutex::new(0));
let mut handles = vec![];
for _ in 0..10 {
let counter_clone = Arc::clone(&counter);
let handle = thread::spawn(move || {
let mut num = counter_clone.lock().unwrap();
*num += 1;
});
handles.push(handle);
}
for handle in handles {
handle.join().unwrap();
}
println!("Counter: {}", *counter.lock().unwrap()); // Always 10
}
// You cannot create a data race with safe Rust code.
// The borrow checker prevents it at compile time.
The Borrow Checker: Lifetime Analysis
The borrow checker's core task is verifying that all references are valid for their entire usage. Every reference has a "lifetime" - a region of the program during which it is guaranteed to be valid. The borrow checker ensures that no reference outlives the value it references.
Most lifetimes are inferred. You only need to annotate lifetimes explicitly when the compiler cannot determine them from context (typically when a function returns a reference and the compiler cannot tell which parameter the reference came from).
// Lifetime annotations: 'a means "some lifetime a"
// The 'a annotation says: the returned reference lives as long
// as the shortest-lived of the two input references
fn longest<'a>(x: &'a str, y: &'a str) -> &'a str {
if x.len() > y.len() { x } else { y }
}
// Without annotation, this would be a compile error because
// the compiler cannot determine the lifetime of the return value
fn main() {
let string1 = String::from("long string");
let result;
{
let string2 = String::from("xyz");
result = longest(string1.as_str(), string2.as_str());
println!("Longest: {}", result); // OK: result used within string2's scope
}
// println!("{}", result); // COMPILE ERROR: string2 dropped, result would dangle
}
// Structs with references also need lifetime annotations
struct Important<'a> {
text: &'a str, // this reference must live as long as the struct
}
impl<'a> Important<'a> {
fn announce(&self, announcement: &str) -> &str {
println!("Attention: {}", announcement);
self.text // returns a reference with lifetime 'a
}
}
The borrow checker analyzes the entire control flow of your program. If there is ANY path through the code where a reference could outlive its value, the compiler rejects the code. Not at runtime - at compile time.
PyO3: Calling Rust from Python
PyO3 is the Rust library for writing Python extensions in Rust. It provides macros and types that handle the Python C API, memory management between Python and Rust objects, and the conversion between Python types and Rust types.
Maturin is the build tool that compiles Rust code with PyO3 into a Python wheel.
# Create a new PyO3 project
pip install maturin
maturin new my_rust_extension --bindings pyo3
cd my_rust_extension
The project structure:
my_rust_extension/
Cargo.toml - Rust project config (dependencies, metadata)
src/
lib.rs - Rust source code
pyproject.toml - Python package config (maturin-based)
Cargo.toml:
[package]
name = "my_rust_extension"
version = "0.1.0"
edition = "2021"
[lib]
name = "my_rust_extension"
crate-type = ["cdylib"] # dynamic library loadable by Python
[dependencies]
pyo3 = { version = "0.20", features = ["extension-module"] }
rayon = "1.8" # parallel iterators
// src/lib.rs - Rust extension for Python
use pyo3::prelude::*;
use pyo3::exceptions::PyValueError;
/// Tokenizes text using a simple whitespace splitter.
/// Returns a list of (token, start_offset, end_offset) tuples.
/// Written in Rust for ~10x speedup over equivalent Python.
#[pyfunction]
fn tokenize_batch(texts: Vec<String>) -> PyResult<Vec<Vec<(String, usize, usize)>>> {
let results: Vec<Vec<(String, usize, usize)>> = texts
.iter()
.map(|text| {
let mut tokens = Vec::new();
let mut start = 0;
let mut in_token = false;
for (i, ch) in text.char_indices() {
if ch.is_whitespace() {
if in_token {
tokens.push((text[start..i].to_string(), start, i));
in_token = false;
}
} else {
if !in_token {
start = i;
in_token = true;
}
}
}
if in_token {
let end = text.len();
tokens.push((text[start..end].to_string(), start, end));
}
tokens
})
.collect();
Ok(results)
}
/// Count token frequencies in a large corpus.
/// Uses Rayon for parallel processing across texts.
#[pyfunction]
fn count_tokens_parallel(texts: Vec<String>) -> PyResult<std::collections::HashMap<String, usize>> {
use rayon::prelude::*;
use std::collections::HashMap;
let counts: HashMap<String, usize> = texts
.par_iter() // Rayon: parallel iterator
.flat_map(|text| text.split_whitespace().map(|s| s.to_string()))
.fold(
|| HashMap::new(),
|mut map, token| {
*map.entry(token).or_insert(0) += 1;
map
},
)
.reduce(
|| HashMap::new(),
|mut a, b| {
for (k, v) in b {
*a.entry(k).or_insert(0) += v;
}
a
},
);
Ok(counts)
}
/// A Rust struct exposed as a Python class
#[pyclass]
struct FastVectorIndex {
vectors: Vec<Vec<f32>>,
dimension: usize,
}
#[pymethods]
impl FastVectorIndex {
#[new]
fn new(dimension: usize) -> Self {
FastVectorIndex {
vectors: Vec::new(),
dimension,
}
}
fn add(&mut self, vector: Vec<f32>) -> PyResult<()> {
if vector.len() != self.dimension {
return Err(PyValueError::new_err(format!(
"Expected dimension {}, got {}",
self.dimension, vector.len()
)));
}
self.vectors.push(vector);
Ok(())
}
/// Brute-force nearest neighbor search
fn search(&self, query: Vec<f32>, k: usize) -> PyResult<Vec<(usize, f32)>> {
if query.len() != self.dimension {
return Err(PyValueError::new_err("Query dimension mismatch"));
}
let mut distances: Vec<(usize, f32)> = self.vectors
.iter()
.enumerate()
.map(|(i, v)| {
let dist: f32 = v.iter()
.zip(query.iter())
.map(|(a, b)| (a - b).powi(2))
.sum::<f32>()
.sqrt();
(i, dist)
})
.collect();
distances.sort_by(|a, b| a.1.partial_cmp(&b.1).unwrap());
distances.truncate(k);
Ok(distances)
}
fn __len__(&self) -> usize {
self.vectors.len()
}
}
/// Module definition - this is what Python imports
#[pymodule]
fn my_rust_extension(_py: Python, m: &PyModule) -> PyResult<()> {
m.add_function(wrap_pyfunction!(tokenize_batch, m)?)?;
m.add_function(wrap_pyfunction!(count_tokens_parallel, m)?)?;
m.add_class::<FastVectorIndex>()?;
Ok(())
}
Build and use:
# Development: build and install in current virtualenv
maturin develop
# Production: build a wheel
maturin build --release
# With specific Python version
maturin build --release --interpreter python3.11
# Using the Rust extension from Python
import my_rust_extension as mre
# Tokenize a batch of texts - Rust is ~8-12x faster than equivalent Python
texts = ["The quick brown fox", "jumps over the lazy dog"] * 10000
tokens = mre.tokenize_batch(texts)
print(f"Tokenized {len(tokens)} texts")
print(f"First text tokens: {tokens[0]}")
# Count token frequencies using parallel Rayon
import time
corpus = ["machine learning is powerful"] * 100000
t0 = time.perf_counter()
counts = mre.count_tokens_parallel(corpus)
print(f"Rust parallel count: {time.perf_counter() - t0:.3f}s")
# Compare to pure Python
from collections import Counter
t0 = time.perf_counter()
py_counts = Counter(
token for text in corpus for token in text.split()
)
print(f"Python Counter: {time.perf_counter() - t0:.3f}s")
# Vector index
index = mre.FastVectorIndex(dimension=128)
import numpy as np
for i in range(1000):
vec = np.random.randn(128).astype(np.float32).tolist()
index.add(vec)
query = np.random.randn(128).astype(np.float32).tolist()
neighbors = index.search(query, k=5)
print(f"5 nearest neighbors: {neighbors}")
HuggingFace Tokenizers in Rust
The tokenizers library from Hugging Face is the production-grade example of Rust in ML. It implements BPE, WordPiece, SentencePiece, and other tokenization algorithms in Rust, with Python bindings via PyO3.
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
import time
# Load a pretrained HuggingFace Rust tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Benchmark: Rust tokenizer vs pure Python equivalent
texts = ["The transformer architecture has revolutionized natural language processing."] * 10000
# Rust-backed tokenizer (HuggingFace tokenizers library)
t0 = time.perf_counter()
encoded = tokenizer(texts, padding=True, truncation=True,
max_length=128, return_tensors="pt")
rust_time = time.perf_counter() - t0
print(f"Rust tokenizer: {rust_time:.3f}s for {len(texts)} texts")
print(f"Throughput: {len(texts)/rust_time:.0f} texts/sec")
# The tokenizer processes ~50,000-100,000 texts/second
# Equivalent pure Python implementation: ~3,000-5,000 texts/second
# The speedup comes from:
# 1. Rust string processing is faster than Python
# 2. Rayon parallel processing within the Rust code
# 3. Zero-copy return: results go directly from Rust memory to PyTorch tensors
# Training a custom BPE tokenizer in Rust
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()
trainer = BpeTrainer(
vocab_size=30000,
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
)
# Train on files - Rust reads and processes in parallel
files = ["path/to/corpus.txt"] # list of text files
# tokenizer.train(files, trainer) # trains in seconds, not hours
# Save the trained tokenizer
# tokenizer.save("my_tokenizer.json")
# Load and use
# loaded = Tokenizer.from_file("my_tokenizer.json")
# output = loaded.encode("Hello, world!")
# print(output.tokens) # ['Hello', ',', 'world', '!']
# print(output.ids) # [7592, 1010, 2088, 999]
Memory Safety Tools for C Extensions
When you write Python C extensions (or debug existing ones), you have several tools for catching memory safety bugs.
AddressSanitizer (ASAN) instruments the compiled binary to detect buffer overflows, use-after-free, and double-free. It adds roughly 2x runtime overhead but catches bugs with pinpoint accuracy.
# Build Python with ASAN enabled (for testing C extensions)
# Ubuntu/Debian:
sudo apt-get install python3-dev libasan6
# Compile your C extension with ASAN
gcc -o my_extension.so my_extension.c \
-fsanitize=address \
-fsanitize=undefined \
-fno-omit-frame-pointer \
-g \
-shared -fPIC \
$(python3-config --includes --ldflags)
# Run with ASAN enabled
ASAN_OPTIONS=detect_leaks=1:symbolize=1 python my_script.py
For Python packages with C extensions (NumPy, custom ops):
# Use cpython with ASAN build for comprehensive detection
# Option 1: install python-asan package
# Option 2: build CPython with ASAN
# For production-like testing with ASAN
LD_PRELOAD=$(gcc --print-file-name=libasan.so) \
PYTHONMALLOC=malloc \
ASAN_OPTIONS=detect_leaks=0 \
python -c "import my_c_extension; my_c_extension.run_tests()"
MemorySanitizer (MSAN) detects reads from uninitialized memory - a different class of bug that ASAN does not catch.
// Example C extension code with a bug that MSAN catches
// but ASAN does not
#include <Python.h>
#include <stdlib.h>
static PyObject *buggy_function(PyObject *self, PyObject *args) {
int n;
if (!PyArg_ParseTuple(args, "i", &n)) return NULL;
float *buf = malloc(n * sizeof(float));
// BUG: buf is not initialized - reading it is undefined behavior
// MSAN will catch this; ASAN will not
float sum = 0;
for (int i = 0; i < n; i++) {
sum += buf[i]; // MSAN: use-of-uninitialized-value
}
free(buf);
return PyFloat_FromDouble(sum);
}
// The correct version:
static PyObject *correct_function(PyObject *self, PyObject *args) {
int n;
if (!PyArg_ParseTuple(args, "i", &n)) return NULL;
float *buf = calloc(n, sizeof(float)); // calloc zero-initializes
if (buf == NULL) return PyErr_NoMemory();
float sum = 0;
for (int i = 0; i < n; i++) {
sum += buf[i];
}
free(buf);
return PyFloat_FromDouble(sum);
}
Rust vs Python: Performance Comparison
# Benchmark: Rust string processing vs Python
# Task: count unique words in a 10 MB text corpus
import time
import my_rust_extension as mre # our PyO3 extension
from collections import Counter
# Generate a synthetic corpus
words = ["the", "quick", "brown", "fox", "jumps", "over", "lazy", "dog",
"machine", "learning", "transformer", "attention", "embedding"]
import random
corpus = [" ".join(random.choices(words, k=100)) for _ in range(100000)]
# Python implementation
def python_word_count(texts):
counts = Counter()
for text in texts:
counts.update(text.split())
return counts
# Benchmark
t0 = time.perf_counter()
py_result = python_word_count(corpus)
py_time = time.perf_counter() - t0
t0 = time.perf_counter()
rs_result = mre.count_tokens_parallel(corpus)
rs_time = time.perf_counter() - t0
print(f"Python: {py_time:.3f}s")
print(f"Rust: {rs_time:.3f}s")
print(f"Speedup: {py_time/rs_time:.1f}x")
print(f"Results match: {dict(py_result) == rs_result}")
# Expected: Rust is 5-15x faster depending on CPU core count
# Rust uses Rayon for automatic work-stealing parallelism
# Python's GIL prevents true parallel CPU use in CPython
When to Rewrite Python in Rust
Not everything should be rewritten in Rust. The decision framework:
Write in Rust when:
- The function is called millions of times per request/second
- It processes text, bytes, or numeric data in a tight loop
- It needs true parallelism (not just async I/O)
- It is a long-running service where memory safety bugs would be catastrophic
- The Python version's performance is provably the bottleneck (measured, not guessed)
Keep in Python when:
- The function is called infrequently
- Its performance is dominated by external I/O (network, database)
- It orchestrates other calls rather than processing data
- The team has no Rust expertise and learning cost exceeds the benefit
The pattern that works in production: identify the 3-5 hottest Python functions via profiling, rewrite them in Rust with PyO3, keep everything else in Python. This gives you 80% of the performance benefit with 20% of the rewrite cost.
# Typical candidates for Rust rewrite in ML pipelines:
# 1. Tokenization (already done by HuggingFace tokenizers)
# 2. Feature hashing and encoding
# 3. Data preprocessing pipelines (per-sample transforms)
# 4. Custom samplers (stratified, weighted, curriculum)
# 5. Metrics computation on large tensors before GPU transfer
# Example: wrapping the HuggingFace Rust tokenizer
from tokenizers import Tokenizer
import torch
class FastTokenizingDataset:
"""Dataset that tokenizes on-the-fly using the Rust tokenizer."""
def __init__(self, texts, tokenizer_path: str, max_length: int = 512):
self.texts = texts
self.tokenizer = Tokenizer.from_file(tokenizer_path)
self.tokenizer.enable_padding(length=max_length)
self.tokenizer.enable_truncation(max_length=max_length)
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
encoding = self.tokenizer.encode(self.texts[idx])
return {
'input_ids': torch.tensor(encoding.ids, dtype=torch.long),
'attention_mask': torch.tensor(encoding.attention_mask, dtype=torch.long),
}
def encode_batch(self, texts):
"""Encode a batch of texts in parallel using Rust's Rayon."""
encodings = self.tokenizer.encode_batch(texts)
input_ids = torch.tensor([e.ids for e in encodings], dtype=torch.long)
attention_masks = torch.tensor(
[e.attention_mask for e in encodings], dtype=torch.long
)
return input_ids, attention_masks
The Rust Memory Safety Model
OWN["Ownership Rule<br/>Every value has exactly one owner"]:::green
BORROW["Borrowing Rule<br/>One mutable OR many immutable refs"]:::blue
LIFE["Lifetime Rule<br/>References cannot outlive values"]:::teal
BOF["Buffer Overflow<br/>PREVENTED: bounds-checked slices"]:::red
UAF["Use-After-Free<br/>PREVENTED: borrow checker"]:::red
DF["Double-Free<br/>PREVENTED: single owner drops once"]:::red
DR["Data Race<br/>PREVENTED: Send + Sync traits"]:::red
OWN --> DF
OWN --> DR
BORROW --> DR
BORROW --> UAF
LIFE --> UAF
classDef blue fill:#dbeafe,color:#1e293b,stroke:#2563eb
classDef teal fill:#ccfbf1,color:#134e4a,stroke:#14b8a6
classDef orange fill:#ffedd5,color:#7c2d12,stroke:#ea580c
classDef green fill:#dcfce7,color:#14532d,stroke:#16a34a
classDef purple fill:#ede9fe,color:#4c1d95,stroke:#7c3aed
classDef red fill:#fee2e2,color:#7f1d1d,stroke:#dc2626
Production Engineering Notes
Building and Distributing Rust Python Extensions
Maturin produces platform-specific wheels. For distribution, you need to build for each target platform separately.
# Build for current platform
maturin build --release
# Build for multiple Python versions
maturin build --release --interpreter python3.9 python3.10 python3.11 python3.12
# Build manylinux wheels (compatible with most Linux distros)
# Requires Docker
docker run --rm -v $(pwd):/io ghcr.io/pyo3/maturin build --release
# Build macOS universal binary (Intel + Apple Silicon)
maturin build --release --target universal2-apple-darwin
# Upload to PyPI
maturin publish
For CI/CD, use GitHub Actions with the maturin action:
# .github/workflows/release.yml
name: Release
on:
push:
tags: ["v*"]
jobs:
linux:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: PyO3/maturin-action@v1
with:
command: build
args: --release --out dist
- uses: actions/upload-artifact@v4
with:
name: wheels-linux
path: dist
Rust in candle: A Pure Rust ML Framework
Hugging Face's candle framework is worth understanding even if you never use it directly, because it represents the direction of ML inference infrastructure.
// candle example: running a small transformer
// This is conceptual - actual API differs slightly
use candle_core::{Device, Tensor};
use candle_nn::{linear, Linear, Module};
struct SimpleTransformer {
embed: Linear,
output: Linear,
}
impl SimpleTransformer {
fn new(vocab_size: usize, d_model: usize, n_classes: usize,
vs: &candle_nn::VarStore) -> candle_core::Result<Self> {
Ok(Self {
embed: linear(vocab_size, d_model, vs.pp("embed"))?,
output: linear(d_model, n_classes, vs.pp("output"))?,
})
}
fn forward(&self, input_ids: &Tensor) -> candle_core::Result<Tensor> {
let embeddings = self.embed.forward(input_ids)?;
let mean = embeddings.mean(1)?;
self.output.forward(&mean)
}
}
The key value proposition of candle: zero Python overhead in the hot path, WASM support (run models in the browser), and deployment without the PyTorch C++ runtime dependency.
Common Mistakes
Writing unsafe Rust without understanding why: Rust has an unsafe keyword that disables the borrow checker for a block of code. Unsafe is sometimes necessary (FFI, certain lock-free data structures, interfacing with hardware). But every unsafe block is a place where the Rust memory safety guarantees do not apply. In PyO3 extensions, avoid unsafe unless absolutely necessary - PyO3 provides safe abstractions for nearly everything. An unsafe block written carelessly is no safer than C.
Forgetting to run ASAN on your C extensions: Most Python C extension bugs are never caught in testing because the bug is a use-after-free or buffer overwrite that happens to not crash on your development machine. The affected memory gets reallocated before anyone checks it. ASAN catches these bugs with certainty. Run your C extension test suite under ASAN before every release. A CI job that runs with ASAN is worth more than 1000 unit tests for memory safety.
Calling .clone() on large strings in Rust hot paths: Rust's clone() is explicit and therefore often overused. In a loop processing millions of strings, clone() on each string allocates a new heap buffer. Use string slices (&str) instead of owned Strings when the data does not need to outlive the function call. In PyO3 functions, prefer accepting &str where possible instead of String.
Assuming Rust PyO3 extensions are automatically parallel: PyO3 acquires the Python GIL for any interaction with Python objects. Code between Python interactions runs without the GIL. For parallelism with Rayon, you must release the GIL explicitly using py.allow_threads(|| { ... }). If you use Rayon inside a PyO3 function without releasing the GIL, you will get no parallelism - all Rayon threads will contend on the GIL.
Interview Q&A
Q: Explain the four classes of memory safety bugs and how Rust prevents each at compile time.
A: The four classes are buffer overflow, use-after-free, double-free, and data races.
Buffer overflow: writing past the end of an allocated buffer. In C, int arr[10]; arr[15] = 5; compiles fine and corrupts adjacent memory at runtime. Rust prevents this because array and slice indexing is bounds-checked: arr[15] will panic with a clear error message rather than silently corrupting memory. You can opt out of bounds checking with get_unchecked() in unsafe code, but safe code is protected by default.
Use-after-free: accessing memory after it has been freed. In C, this is trivially possible with a raw pointer. In Rust, the borrow checker enforces that no reference can outlive the value it references. The compiler tracks the lifetime of every reference and rejects any code where a reference might be accessed after the owner has dropped (and freed) the value. This is a compile-time guarantee - there is no runtime check.
Double-free: calling free() twice on the same pointer. In Rust, values have exactly one owner, and when the owner drops, the value's destructor runs exactly once. You cannot call drop() manually twice on the same value - the first call moves the value (invalidating the binding), and the second would be a compile error. The one-owner rule makes double-free impossible in safe code.
Data race: two threads simultaneously accessing shared mutable state without synchronization. Rust's Send and Sync marker traits encode thread safety at the type level. A type is Send if it can be transferred to another thread; Sync if it can be accessed from multiple threads simultaneously. Raw pointers are neither. The borrow checker ensures that mutable references cannot be shared across thread boundaries unless the type explicitly implements Sync (which requires that sharing is safe, usually via a Mutex or RwLock).
Q: How does PyO3 handle memory management at the Python-Rust boundary?
A: The Python-Rust boundary has two different memory management models: Python uses reference counting with a cyclic GC, and Rust uses ownership with deterministic drop semantics.
PyO3 bridges these through the Py<T> type (a reference-counted smart pointer to a Python-managed object) and the PyObject type (a raw Python object reference). When Rust code receives a Python object, PyO3 wraps it in a type that increments the Python reference count. When this wrapper drops, it decrements the reference count. This ensures Python objects are not freed while Rust is using them.
For Rust objects exposed to Python via #[pyclass], PyO3 wraps the Rust struct in a Python object. The Rust struct is heap-allocated and owned by Python's reference counting system. When the Python reference count reaches zero, PyO3 calls the Rust struct's drop() method. This is deterministic if Python's reference counting handles it promptly, but Python's cyclic GC may delay the drop if the object is part of a reference cycle.
The GIL (Global Interpreter Lock) is relevant here: PyO3 acquires the GIL for any interaction with Python objects. Code that only works with Rust data can release the GIL via py.allow_threads(), enabling true parallelism with Rayon. For compute-heavy functions that accept Python data and return Python data, the pattern is: acquire GIL, extract Rust data from Python objects, release GIL, do parallel Rust computation, acquire GIL, convert Rust results back to Python objects.
Q: What is the difference between AddressSanitizer and MemorySanitizer, and when do you use each?
A: AddressSanitizer (ASAN) detects spatial and temporal memory safety violations: buffer overflows (out-of-bounds reads and writes), use-after-free (accessing freed memory), and double-free. It works by inserting "shadow memory" alongside every allocation that tracks the validity of each byte. Accessing a byte with invalid shadow memory triggers a report.
MemorySanitizer (MSAN) detects reads from uninitialized memory. It works by tracking, for every byte of memory, whether that byte has been initialized. Reading an uninitialized byte (even if it happens to have a value that produces a correct result) is flagged.
The bugs they catch are different. ASAN catches spatial safety bugs (you went out of bounds) and temporal safety bugs (you used memory after freeing it). MSAN catches a semantic bug: your code made a decision based on uninitialized data, which means the program's behavior is technically undefined even if the uninitialized byte happened to be zero.
In practice, use both. Run ASAN for routine testing - it catches the most common and most dangerous bugs. Run MSAN when debugging subtle data corruption issues or when reviewing new C extension code for the first time. MSAN requires a fully instrumented build (all dependencies compiled with MSAN), which makes it harder to use in practice with complex projects like Python itself.
For Python C extensions specifically: ASAN is the most practical starting point. Build your extension with -fsanitize=address -fsanitize=undefined, run your test suite, and fix whatever it reports. The undefined behavior sanitizer (UBSan, invoked with the same -fsanitize=undefined flag) additionally catches integer overflow, null pointer dereference, and other C undefined behaviors that neither ASAN nor MSAN cover.
Q: Why did Hugging Face rewrite tokenizers in Rust, and what were the specific benefits?
A: There were three independent motivations: performance, memory safety, and parallelism.
Performance: Python string processing is slow because Python strings are Unicode objects with full unicode metadata per character, and the CPython interpreter adds function call overhead per operation. A BPE tokenizer inner loop performs millions of string comparisons and dictionary lookups per second. In C this would be fast; in Python it was a bottleneck for large batches. The Rust rewrite achieved throughput of 100,000+ sentences per second on a single core, roughly 10x faster than the Python equivalent.
Memory safety: the original Python tokenizers called into C code for performance-critical parts. That C code had the usual memory safety risks. The production incident described in this lesson's opening (use-after-free in a multithreaded context) was a direct consequence. Rust's borrow checker makes the entire class of use-after-free and data race bugs impossible. After the Rust rewrite, zero memory safety incidents in five years of production use.
Parallelism: Python's GIL prevents true CPU parallelism in Python code. Tokenizing a batch of 1000 sentences in Python runs serially (or requires multiprocessing with its overhead). The Rust tokenizers library uses Rayon for automatic work-stealing parallelism: a 1000-sentence batch is processed on all available CPU cores simultaneously. PyO3 releases the GIL during the Rayon parallel section, so Python gets true parallelism without multiprocessing overhead.
The practical result: a model's data loading and tokenization step went from a bottleneck to essentially free. Engineers who previously needed 8 DataLoader workers to keep a single GPU busy needed 2-3 after the Rust tokenizer.
Q: When is it NOT worth rewriting Python code in Rust?
A: The decision to rewrite in Rust has both a technical dimension and an organizational one. Both must favor rewriting for it to be worth it.
Technically, it is not worth rewriting when: the code is I/O-bound (waiting for network, disk, database) - Rust is faster for CPU work, not for waiting. Async Python with asyncio or an async library can achieve the same concurrency for I/O-bound work without a rewrite. It is not worth rewriting if the function runs rarely (hundreds of calls per day, not millions per second). It is not worth rewriting if the bottleneck has not been measured - premature optimization in Rust is worse than premature optimization in Python because the cost of developing and maintaining the extension is higher.
Organizationally, it is not worth rewriting when the team has no Rust expertise and the cost of learning Rust exceeds the performance benefit. The learning curve for Rust is steep - the borrow checker rejects code that experienced programmers would write in C++ or Go without a second thought. A team that writes Rusty Python (careless Python disguised as Rust) benefits neither from the performance nor the safety guarantees.
The rule of thumb for ML teams: profile first, identify the top 3-5 functions by total CPU time across production traffic, and evaluate each individually. Tokenization, data serialization, and custom numeric transforms are strong candidates. Model training and inference are almost always already running in C++/CUDA via PyTorch - no Python to rewrite there. Serving orchestration, feature joining, and logging are usually I/O-bound and should use async patterns instead.
Q: Explain Rust's ownership model to someone coming from Python. What is the most common conceptual difficulty?
A: In Python, variables are names bound to objects. Multiple variables can reference the same object simultaneously, and the reference count determines when the object is freed. You never think about who "owns" an object - everyone can have a reference, and the GC handles cleanup.
In Rust, every value has exactly one owner - one variable or data structure that is responsible for that value. When the owner goes out of scope, the value is freed. Assignment does not create a reference; it transfers ownership. After let s2 = s1;, s1 is gone - it transferred ownership to s2. This is called a "move".
For types that are cheap to copy (integers, booleans), Rust copies the value instead of moving it. For heap-allocated types (String, Vec, Box), the default is a move, not a copy. You can ask Rust to copy (clone) explicitly with .clone(), but it makes the copy cost visible.
Borrowing is how you give a function access to a value without transferring ownership. &val is an immutable borrow (read-only reference). &mut val is a mutable borrow (read-write reference, but exclusive - no other borrows can exist simultaneously). Borrowing is temporary: the borrow ends when the reference goes out of scope.
The most common conceptual difficulty for Python programmers is the conflict between "I want multiple parts of my program to share access to this data" (a common Python pattern) and the Rust rule that mutable access must be exclusive. In Python, two objects can both hold a reference to the same list and both append to it - Python handles the concurrency issue by being single-threaded (GIL). In Rust, you must choose an explicit sharing mechanism: Rc<RefCell<T>> for single-threaded sharing with runtime checks, or Arc<Mutex<T>> for multi-threaded sharing with locking. The compiler forces you to be explicit about sharing, which feels restrictive at first but prevents entire categories of bugs.
