Skip to main content

C and C++ for ML Systems

The Night the Training Job Never Returned

It is 2 AM and your distributed training job is frozen. No Python traceback. No error message. The process is alive - htop shows 100% CPU on all cores - but nothing is progressing. You attach a debugger and discover the hang is deep inside the PyTorch C++ ATen library, inside a custom CUDA kernel you registered three weeks ago. The Python stack trace ends at a C extension boundary and below that it is all assembly and raw memory addresses.

This scenario plays out dozens of times a week at every serious ML team. PyTorch's core is roughly 1.2 million lines of C++. TensorFlow's kernels are C++. ONNX Runtime is C++. TensorRT is C++. cuDNN is C. The Python you write every day is a thin veneer over a massive C++ engine, and the moment something goes wrong at scale - memory corruption, race conditions, undefined behavior - you need to read and reason about that C++ code.

The gap between ML engineers who can debug at this level and those who cannot is enormous in practice. The former can diagnose a segfault in a custom op, understand why torch.compile is generating suboptimal code, write a pybind11 extension that avoids unnecessary data copies, and profile memory allocation patterns at the C layer. The latter file a GitHub issue and wait.

This lesson builds the C++ literacy an ML systems engineer needs. Not to write production C++ from scratch - that is a separate career - but to read PyTorch internals, write a competent pybind11 extension, understand what RAII means and why it matters for GPU handles, and use sanitizers to find the bugs that only manifest at 3 AM on a 256-GPU cluster.

The path forward is intentional: start with the C fundamentals that map directly to things you already know (arrays, memory, functions), layer in the C++ features that ML frameworks actually use (templates, smart pointers, lambdas, move semantics), connect it all to real PyTorch patterns, and finish with the debugging and tooling workflow that separates engineers who can work at this level from those who cannot.

Why This Exists

Python exists because C exists. Every time you call tensor.matmul(other) in PyTorch, Python resolves the attribute, dispatches through the Python/C++ bridge (pybind11), crosses into C++ land, hits the ATen dispatcher which looks up the correct kernel for your device and dtype, and finally executes a BLAS or CUDA routine written in C or C++. That whole path - from your one-line Python to actual floating-point arithmetic on hardware - traverses several layers of C++ before a single multiply happens.

The reason ML frameworks are written in C++ rather than Python comes down to three hard constraints. First, performance: Python's interpreter overhead (frame creation, reference counting, global interpreter lock) is prohibitive for tight numerical loops. A Python loop over 1 million floats is roughly 100 times slower than the equivalent C loop. Second, hardware access: GPU kernels, BLAS libraries, and hardware intrinsics are written in C or C++, and calling them efficiently requires minimal overhead at the boundary. Third, portability: C++ compiles to native code on every platform (x86, ARM, RISC-V, mobile), while Python always requires a runtime.

Before C++ frameworks, researchers wrote numerical code in Fortran (BLAS, LAPACK) or in MATLAB/NumPy, which themselves call Fortran or C. The problem was that extending these systems - adding a new layer type, a new optimizer, a custom CUDA kernel - required writing Fortran or C with no ergonomics, no classes, no generics, and extremely painful debugging. C++ with modern features (RAII, templates, smart pointers) gave framework authors a language that is fast enough to talk to hardware directly and expressive enough to build a large, maintainable codebase.

The specific innovation that made C++ the default for ML was pybind11 (and before it, Cython and SWIG): a header-only library that makes it ergonomic to expose C++ classes and functions to Python with NumPy array interop, automatic reference counting, and exception translation. pybind11 is what PyTorch uses for its Python bindings. Once that bridge existed, the architecture became: write the fast stuff in C++, expose it to Python, let researchers iterate in Python. That architecture has not changed.

Historical Context

C was created by Dennis Ritchie at Bell Labs between 1969 and 1973, primarily to rewrite the Unix operating system (which had been written in assembly) in a portable higher-level language. The language was deliberately minimal - it gave you direct control over memory and hardware with almost no runtime overhead. The price was safety: C has no bounds checking, no garbage collection, and no type safety beyond basic integers and pointers.

Bjarne Stroustrup at Bell Labs extended C starting in 1979, initially calling it "C with Classes." The goal was to add object-oriented programming (classes, inheritance, virtual functions) while preserving C's performance characteristics. The language became C++ in 1983 and got its first ISO standard in 1998 (C++98). The modern era of C++ started with C++11, which added move semantics, lambda functions, auto type deduction, std::thread, smart pointers, and range-based for loops. C++14, C++17, and C++20 added further improvements: structured bindings, if constexpr, concepts, std::atomic, and coroutines.

PyTorch's C++ core, ATen (A Tensor library), was designed from the ground up to be a clean C++ tensor library that the Python frontend wraps. The dispatch mechanism in ATen - where a Python call like tensor.relu() gets routed to the correct CPU/CUDA/quantized/sparse kernel - is built on C++ dispatch tables registered at startup. Understanding that dispatch mechanism is directly useful for anyone writing custom PyTorch operators.

The performance-critical path in ML has never moved away from C and C++. Even frameworks like JAX (Python-first by design) compile Python to XLA HLO, which itself is a C++ IR that compiles to hardware-specific code. The language at the bottom is always C or C++.

Core Concepts

Pointers and Memory: The Mental Model

A pointer is a variable that holds the address of another variable in memory. This is the single most important concept in C/C++ for an ML engineer, because it is what makes zero-copy data sharing possible - the reason that PyTorch tensors can share underlying storage, that NumPy arrays can be wrapped without copying, and that pybind11 can expose C++ objects to Python without duplication.

Stack Heap
+-----------------+ +------------------------------------+
| ptr = 0x4000 |----> | float data[1024*1024*3] |
| size = 3145728 | | 0x4000: [0.1, 0.2, 0.3 ...] |
+-----------------+ +------------------------------------+

The pointer ptr lives on the stack and is 8 bytes (on a 64-bit system). The actual image data lives on the heap and is 12 MB. You can pass ptr by value to 100 functions without copying 12 MB. This is the essence of what PyTorch's Tensor does: the Python object holds a reference to a Storage which holds a pointer to the actual data buffer.

// C: raw pointer arithmetic
float* data = (float*)malloc(1024 * sizeof(float));
data[0] = 1.0f; // equivalent to *(data + 0) = 1.0f
data[1023] = 2.0f; // last valid index
// data[1024] = 3.0f; // undefined behavior - out of bounds

// Pointer arithmetic: data + i is the address of element i
for (int i = 0; i < 1024; i++) {
*(data + i) = (float)i; // same as data[i] = i
}

free(data); // must free what malloc allocated

The critical insight: malloc and free are manual. Forget to call free and you have a memory leak. Call free twice and you have undefined behavior (likely a crash). Call free and then use the pointer and you have a use-after-free bug (security vulnerability and crash). These three errors are responsible for a huge fraction of C program bugs.

C Structs and Arrays as Tensors

Before C++ classes exist, C structs are how you group related data. A simplified tensor descriptor looks like:

typedef struct {
float* data; // pointer to raw float data
int* shape; // array of dimension sizes
int* strides; // array of strides (in elements)
int ndim; // number of dimensions
int numel; // total element count
} SimpleTensor;

SimpleTensor make_tensor(int* shape, int ndim) {
SimpleTensor t;
t.ndim = ndim;
t.shape = (int*)malloc(ndim * sizeof(int));
t.strides = (int*)malloc(ndim * sizeof(int));

t.numel = 1;
for (int i = 0; i < ndim; i++) {
t.shape[i] = shape[i];
t.numel *= shape[i];
}

// Compute row-major (C-contiguous) strides
t.strides[ndim - 1] = 1;
for (int i = ndim - 2; i >= 0; i--) {
t.strides[i] = t.strides[i + 1] * t.shape[i + 1];
}

t.data = (float*)malloc(t.numel * sizeof(float));
return t;
}

// Access element at [i, j] in a 2D tensor
float get_2d(SimpleTensor* t, int i, int j) {
return t->data[i * t->strides[0] + j * t->strides[1]];
}

This is almost exactly how PyTorch's at::Tensor works internally, with more type safety and many more features. The strides mechanism (which enables non-contiguous views, transpose without copy, and slicing) is a pure C array trick.

C++ Classes: Encapsulation for Safety

C++ classes let you attach constructors and destructors to structs, which is the foundation of RAII (described below). They also let you enforce invariants:

#include <vector>
#include <stdexcept>

class Tensor {
public:
// Constructor: allocates storage
Tensor(std::vector<int> shape) : shape_(shape) {
numel_ = 1;
for (int s : shape_) numel_ *= s;
data_.resize(numel_, 0.0f);
}

// Destructor: std::vector handles memory automatically
~Tensor() = default;

float& at(int i) {
if (i < 0 || i >= numel_)
throw std::out_of_range("index out of bounds");
return data_[i];
}

int numel() const { return numel_; }
const std::vector<int>& shape() const { return shape_; }

private:
std::vector<float> data_;
std::vector<int> shape_;
int numel_;
};

std::vector is the key difference from the C version: it owns its memory and frees it automatically when the Tensor object is destroyed. No malloc, no free, no memory leaks.

RAII: The Most Important C++ Pattern

RAII stands for Resource Acquisition Is Initialization. The idea: tie the lifetime of a resource (memory, file handle, GPU handle, mutex lock) to the lifetime of a C++ object. When the object is constructed, acquire the resource. When the object is destroyed (goes out of scope), release the resource. Because C++ guarantees destructors run when objects go out of scope - even if an exception is thrown - this makes resource management automatic and exception-safe.

This pattern is pervasive in ML systems code:

#include <cuda_runtime.h>
#include <stdexcept>
#include <string>

// RAII wrapper for a CUDA stream
class CudaStream {
public:
CudaStream() {
cudaError_t err = cudaStreamCreate(&stream_);
if (err != cudaSuccess) {
throw std::runtime_error(
std::string("cudaStreamCreate failed: ") +
cudaGetErrorString(err));
}
}

// Always called - even if exception thrown later
~CudaStream() {
cudaStreamDestroy(stream_);
}

// Prevent copying (would double-free the stream)
CudaStream(const CudaStream&) = delete;
CudaStream& operator=(const CudaStream&) = delete;

// Allow moving
CudaStream(CudaStream&& other) noexcept
: stream_(other.stream_) {
other.stream_ = nullptr;
}

cudaStream_t get() const { return stream_; }

private:
cudaStream_t stream_ = nullptr;
};

// Usage:
void run_inference(float* input, float* output, int batch_size) {
CudaStream stream; // stream created here
// ... launch kernels on stream.get() ...
cudaStreamSynchronize(stream.get());
// stream destroyed here automatically - even if exception thrown
}

Without RAII, every error path would need an explicit cudaStreamDestroy call. With RAII, you write it once in the destructor and it is guaranteed to run.

Smart Pointers

C++ smart pointers are RAII wrappers around raw pointers. They eliminate 90% of manual memory management:

#include <memory>
#include <vector>

// std::unique_ptr: sole ownership, zero overhead vs raw pointer
std::unique_ptr<float[]> alloc_buffer(int n) {
return std::make_unique<float[]>(n);
// No need to call delete[] - unique_ptr does it automatically
}

// std::shared_ptr: shared ownership via reference counting
// Used in PyTorch for Tensor storage (multiple tensors can share one storage)
struct TensorStorage {
std::vector<float> data;
explicit TensorStorage(int n) : data(n, 0.0f) {}
};

class SharedTensor {
public:
SharedTensor(int numel)
: storage_(std::make_shared<TensorStorage>(numel)),
offset_(0), numel_(numel) {}

// Create a view (slice) - shares the same storage
SharedTensor slice(int start, int length) const {
SharedTensor view = *this;
view.offset_ = offset_ + start;
view.numel_ = length;
// storage_ ref-count incremented - no copy of data
return view;
}

float& at(int i) { return storage_->data[offset_ + i]; }
int numel() const { return numel_; }

private:
std::shared_ptr<TensorStorage> storage_;
int offset_;
int numel_;
};

// std::weak_ptr: non-owning reference, breaks cycles
// Used to avoid circular references in computation graphs

The PyTorch at::Tensor uses exactly this pattern: a shared_ptr<Storage> plus an offset and metadata. Multiple Python tensor objects can refer to overlapping slices of the same contiguous buffer with zero copies.

Templates: Generic Programming for Type Dispatch

Templates allow writing code that works for multiple types without runtime overhead. ML frameworks use them extensively to write one kernel implementation that handles float32, float16, bfloat16, int8, etc.:

#include <cmath>
#include <type_traits>

// Generic elementwise ReLU - works for any floating-point type
template <typename T>
void relu_inplace(T* data, int n) {
for (int i = 0; i < n; i++) {
data[i] = data[i] > T(0) ? data[i] : T(0);
}
}

// Template specialization for float (can use SIMD here)
template <>
void relu_inplace<float>(float* data, int n) {
for (int i = 0; i < n; i++) {
data[i] = data[i] > 0.0f ? data[i] : 0.0f;
}
}

// if constexpr (C++17): compile-time conditional - no runtime cost
template <typename T>
T safe_sqrt(T x) {
if constexpr (std::is_floating_point_v<T>) {
return x >= T(0) ? std::sqrt(x) : T(0);
} else {
// integer types: integer sqrt
return static_cast<T>(std::sqrt(static_cast<double>(x)));
}
}

// Usage - each call generates a separate optimized version
relu_inplace<float>(float_data, n);
relu_inplace<double>(double_data, n);

In PyTorch source, the macro AT_DISPATCH_FLOATING_TYPES expands to a template instantiation for each supported floating-point type based on a runtime dtype tag. The runtime tag is checked once, then the correct template instantiation runs at full speed with no further branching.

Move Semantics: Avoiding Unnecessary Copies

Move semantics (C++11) let you transfer ownership of resources from one object to another instead of copying them. This is critical for performance when returning large objects from functions:

#include <vector>
#include <utility>

class ModelWeights {
public:
std::vector<float> data;

explicit ModelWeights(int n) : data(n, 0.0f) {}

// Move constructor: steals data from other (no copy)
ModelWeights(ModelWeights&& other) noexcept
: data(std::move(other.data)) {}

// Move assignment
ModelWeights& operator=(ModelWeights&& other) noexcept {
data = std::move(other.data);
return *this;
}
};

// Return by value - move semantics make this zero-copy
ModelWeights load_weights(const std::string& path) {
ModelWeights w(1024 * 1024);
// ... fill w.data from file ...
return w; // moved (NRVO or explicit move), not copied
}

// Without move semantics (C++03), this would copy 4MB
// With move semantics, it is O(1): just pointer swap
ModelWeights weights = load_weights("model.bin");

Lambda Functions and std::function

Lambdas are anonymous functions that can capture variables from their enclosing scope. They are used heavily in PyTorch for callbacks, custom operations, and the dispatch mechanism:

#include <functional>
#include <vector>
#include <algorithm>
#include <iostream>

// Basic lambda
auto relu = [](float x) { return x > 0.0f ? x : 0.0f; };

// Lambda capturing by reference
float scale = 2.0f;
auto scale_fn = [&scale](float x) { return x * scale; };
scale = 3.0f; // scale_fn sees updated value
std::cout << scale_fn(1.0f); // prints 3.0

// std::function: type-erased callable (useful for callbacks)
std::function<float(float)> activation = relu;
activation = [](float x) { return std::tanh(x); }; // can reassign

// Using lambda with STL algorithms
std::vector<float> activations = {-1.0f, 0.5f, -0.3f, 2.0f};
std::transform(activations.begin(), activations.end(),
activations.begin(),
[](float x) { return x > 0.0f ? x : 0.0f; });

In PyTorch, the operator dispatch table stores std::function-like objects (actually KernelFunction wrappers) keyed by dispatch key (CPU, CUDA, XLA, etc.). When you call tensor.relu(), the runtime looks up the function object for your device and calls it.

std::vector vs Raw Arrays

#include <vector>
#include <array>

// Raw C array: fixed size, stack if small enough
float weights[1024]; // stack allocated
float* big_array = new float[1024*1024]; // heap, manual free

// std::vector: dynamic size, heap allocated, RAII
std::vector<float> v(1024, 0.0f); // 1024 elements initialized to 0
v.push_back(1.0f); // grows automatically
v.resize(2048); // resize (may reallocate)
float* raw = v.data(); // get raw pointer for C APIs

// std::array: fixed size, stack, no heap, with STL interface
std::array<float, 4> shape = {1, 3, 224, 224}; // compile-time size

// Performance note: v.data() pointer is stable after reserve()
v.reserve(4096); // pre-allocate to avoid reallocations
// After reserve: push_back does not reallocate until 4096 elements

For ML extensions, std::vector<float> is the right choice for variable-length buffers. Raw arrays are still used when the size is known at compile time or when you need to pass a pointer to a C API (use .data()).

Code Examples

Calling C from Python with ctypes

The simplest way to call compiled C code from Python - no build system required for simple cases:

// fast_ops.c
// Compile: gcc -O3 -shared -fPIC -o fast_ops.so fast_ops.c -lm
#include <stdlib.h>
#include <math.h>

// Elementwise ReLU on a float array
void relu_forward(float* data, int n) {
for (int i = 0; i < n; i++) {
if (data[i] < 0.0f) data[i] = 0.0f;
}
}

// Layer normalization (simplified, 1D)
void layer_norm(float* data, float* out, int n, float eps) {
float mean = 0.0f;
for (int i = 0; i < n; i++) mean += data[i];
mean /= n;

float var = 0.0f;
for (int i = 0; i < n; i++) {
float diff = data[i] - mean;
var += diff * diff;
}
var = var / n + eps;
float inv_std = 1.0f / sqrtf(var);

for (int i = 0; i < n; i++) {
out[i] = (data[i] - mean) * inv_std;
}
}
# calling_c_from_python.py
import ctypes
import numpy as np

lib = ctypes.CDLL("./fast_ops.so")

# Declare function signatures - critical for correctness
lib.relu_forward.argtypes = [
ctypes.POINTER(ctypes.c_float),
ctypes.c_int,
]
lib.relu_forward.restype = None

lib.layer_norm.argtypes = [
ctypes.POINTER(ctypes.c_float),
ctypes.POINTER(ctypes.c_float),
ctypes.c_int,
ctypes.c_float,
]
lib.layer_norm.restype = None

# Get ctypes pointer from NumPy - zero copy
x = np.array([-1.0, 2.0, -0.5, 3.0, -2.0], dtype=np.float32)
ptr = x.ctypes.data_as(ctypes.POINTER(ctypes.c_float))
lib.relu_forward(ptr, len(x))
print("After ReLU:", x) # [0. 2. 0. 3. 0.]

data = np.array([1.0, 2.0, 3.0, 4.0, 5.0], dtype=np.float32)
out = np.zeros_like(data)
data_ptr = data.ctypes.data_as(ctypes.POINTER(ctypes.c_float))
out_ptr = out.ctypes.data_as(ctypes.POINTER(ctypes.c_float))
lib.layer_norm(data_ptr, out_ptr, len(data), 1e-5)
print("LayerNorm:", out) # approx [-1.41, -0.71, 0.0, 0.71, 1.41]

pybind11 Module with NumPy Interop

pybind11 is what PyTorch uses. It handles type conversion and reference counting automatically:

// ml_ops.cpp
#include <pybind11/pybind11.h>
#include <pybind11/numpy.h>
#include <cmath>
#include <stdexcept>

namespace py = pybind11;

// ReLU - modifies array in place
void relu_inplace(py::array_t<float> arr) {
py::buffer_info buf = arr.request();
if (buf.ndim != 1)
throw std::runtime_error("Expected 1D array");

float* data = static_cast<float*>(buf.ptr);
int n = buf.shape[0];
for (int i = 0; i < n; i++) {
if (data[i] < 0.0f) data[i] = 0.0f;
}
}

// Softmax - returns a new array
py::array_t<float> softmax(py::array_t<float> logits) {
py::buffer_info buf = logits.request();
if (buf.ndim != 1)
throw std::runtime_error("Expected 1D array");

int n = buf.shape[0];
float* src = static_cast<float*>(buf.ptr);

auto result = py::array_t<float>(n);
py::buffer_info out_buf = result.request();
float* dst = static_cast<float*>(out_buf.ptr);

// Numerically stable: subtract max before exp
float max_val = src[0];
for (int i = 1; i < n; i++) max_val = std::max(max_val, src[i]);

float sum = 0.0f;
for (int i = 0; i < n; i++) {
dst[i] = std::exp(src[i] - max_val);
sum += dst[i];
}
for (int i = 0; i < n; i++) dst[i] /= sum;

return result;
}

PYBIND11_MODULE(ml_ops, m) {
m.doc() = "Fast ML operations via pybind11";
m.def("relu_inplace", &relu_inplace, "In-place ReLU",
py::arg("arr"));
m.def("softmax", &softmax, "Numerically stable softmax",
py::arg("logits"));
}

CMakeLists.txt for an ML Extension

# CMakeLists.txt
cmake_minimum_required(VERSION 3.18)
project(ml_ops LANGUAGES CXX)

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

find_package(Python3 REQUIRED COMPONENTS Interpreter Development NumPy)
find_package(pybind11 REQUIRED)
find_package(CUDAToolkit QUIET)

pybind11_add_module(ml_ops ml_ops.cpp)

target_include_directories(ml_ops PRIVATE
${Python3_NumPy_INCLUDE_DIRS}
)

target_compile_options(ml_ops PRIVATE
-O3 -march=native -ffast-math -Wall -Wextra
)

if(CUDAToolkit_FOUND)
enable_language(CUDA)
target_sources(ml_ops PRIVATE ml_ops_cuda.cu)
target_compile_options(ml_ops PRIVATE
$<$<COMPILE_LANGUAGE:CUDA>:
--use_fast_math -O3 -arch=sm_80>
)
target_link_libraries(ml_ops PRIVATE CUDA::cudart)
endif()

# Build:
# mkdir build && cd build && cmake .. && make -j$(nproc)
# cp ml_ops*.so ..
# python3 -c "import ml_ops; print('OK')"

RAII Wrapper for a File Descriptor

// raii_fd.hpp
#pragma once
#include <unistd.h>
#include <fcntl.h>
#include <stdexcept>
#include <string>

class FileDescriptor {
public:
explicit FileDescriptor(const std::string& path,
int flags = O_RDONLY)
: fd_(open(path.c_str(), flags, 0644)) {
if (fd_ < 0)
throw std::runtime_error("open() failed: " + path);
}

~FileDescriptor() {
if (fd_ >= 0) close(fd_);
}

FileDescriptor(const FileDescriptor&) = delete;
FileDescriptor& operator=(const FileDescriptor&) = delete;

FileDescriptor(FileDescriptor&& other) noexcept
: fd_(other.fd_) { other.fd_ = -1; }

int get() const { return fd_; }
bool valid() const { return fd_ >= 0; }

private:
int fd_;
};

// Usage example:
// void save_checkpoint(const std::string& path,
// const void* data, size_t size) {
// FileDescriptor fd(path, O_WRONLY | O_CREAT | O_TRUNC);
// write(fd.get(), data, size);
// // fd closed automatically here, even if write throws
// }

Sanitizers: Finding Bugs Before Production

# AddressSanitizer - finds heap/stack overflows, use-after-free
g++ -g -fsanitize=address,undefined \
-fno-omit-frame-pointer -O1 \
ml_ops.cpp -o ml_ops_asan

# ThreadSanitizer - finds data races
g++ -g -fsanitize=thread -O1 ml_ops.cpp -o ml_ops_tsan

# Run normally - errors appear at runtime
./ml_ops_asan

# Example ASAN report for out-of-bounds write:
# ==12345==ERROR: AddressSanitizer: heap-buffer-overflow
# WRITE of size 4 at 0x... thread T0
# #0 relu_inplace ml_ops.cpp:15
# #1 main ml_ops.cpp:42
# Heap block allocated at:
# #0 operator new[] ml_ops.cpp:8

GDB Debugging Commands for C Extensions

# Start GDB with a Python program that imports your C extension
gdb --args python3 test_ml_ops.py

# Inside GDB:
(gdb) break ml_ops.cpp:relu_inplace # breakpoint in C++ code
(gdb) run # run until breakpoint
(gdb) info locals # show local variables
(gdb) print data[0] # print first element
(gdb) print *data@10 # print first 10 elements
(gdb) backtrace # show call stack
(gdb) thread apply all backtrace # all threads (find deadlocks)
(gdb) up # go up one stack frame
(gdb) next # step over
(gdb) step # step into
(gdb) watch data[0] # stop when data[0] changes
(gdb) quit

# Attach to a running process (frozen training job)
gdb python3 $(pgrep -f train.py)
(gdb) backtrace # see where it is hung

PyTorch Operator Registration Pattern

This is how custom ops are registered in PyTorch's C++ dispatch system:

// custom_op.cpp
#include <torch/extension.h>
#include <ATen/ATen.h>

at::Tensor fast_gelu_cpu(const at::Tensor& input) {
TORCH_CHECK(input.is_contiguous(), "Input must be contiguous");
TORCH_CHECK(input.dtype() == at::kFloat, "Only float32 supported");

auto output = at::empty_like(input);
const float* in_ptr = input.data_ptr<float>();
float* out_ptr = output.data_ptr<float>();
int n = input.numel();

for (int i = 0; i < n; i++) {
float x = in_ptr[i];
float x3 = x * x * x;
// GELU approximation
float inner = 0.7978845608f * (x + 0.044715f * x3);
out_ptr[i] = 0.5f * x * (1.0f + std::tanh(inner));
}

return output;
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
m.def("fast_gelu", &fast_gelu_cpu, "Fast GELU (CPU)");
}
# setup.py
from torch.utils.cpp_extension import BuildExtension, CppExtension
from setuptools import setup

setup(
name="custom_ops",
ext_modules=[
CppExtension(
"custom_ops",
["custom_op.cpp"],
extra_compile_args=["-O3", "-march=native"],
)
],
cmdclass={"build_ext": BuildExtension},
)
# Build: python setup.py build_ext --inplace
# Use: import custom_ops; out = custom_ops.fast_gelu(tensor)

Production Engineering Notes

Memory Layout Matters as Much as Algorithm

The cache behavior of your data access pattern often dominates runtime. A row-major matrix multiply that accesses memory sequentially is 10x faster than a naive implementation that accesses memory column-by-column, even with the same number of FLOPs. In C++, std::vector<float> stores data contiguously in row-major order. std::vector<std::vector<float>> stores each row as a separate heap allocation - catastrophic for cache performance and completely wrong for passing to BLAS.

For ML: always use a single flat buffer with computed strides. The PyTorch storage model (one contiguous buffer + stride metadata) exists precisely because it lets you represent any view (transpose, slice, broadcast) without copying data.

Avoid std::function for Hot Paths

std::function has overhead from type erasure: it allocates memory for the stored callable and cannot be inlined by the compiler. In tight loops (per-element operations), prefer templates or direct function calls. std::function is fine for callbacks that are called once per batch, but wrong for per-element dispatch.

Alignment and SIMD

SIMD (Single Instruction Multiple Data) instructions like AVX2 operate on 256-bit (8 float) chunks. They require 32-byte alignment. malloc guarantees only 16-byte alignment on most platforms. Use aligned_alloc(32, n * sizeof(float)) or std::aligned_alloc when you plan to use SIMD. PyTorch's allocator always uses 64-byte alignment to support 512-bit AVX-512.

// Aligned allocation
float* data = static_cast<float*>(
std::aligned_alloc(64, n * sizeof(float)));
// ... use data ...
std::free(data); // use free, not delete[]

Static vs Dynamic Linking for Deployment

When building a pybind11 extension for deployment (Docker, serverless, etc.), static linking of dependencies reduces the "works on my machine" problem. Link libstdc++ statically (-static-libstdc++) unless you are certain the deployment environment has a compatible version. Always check with ldd ml_ops.so what shared libraries are required.

C++17 and C++20 Features Worth Knowing

if constexpr (C++17) evaluates a branch at compile time - the not-taken branch is not instantiated, so it does not need to be valid code for the current template parameters. Structured bindings (auto [key, val] = map.begin()) make iterating maps readable. Concepts (C++20) let you constrain template parameters (template <typename T> requires std::floating_point<T>) so compiler errors are readable rather than 100-line template errors. Ranges (C++20) make STL algorithms composable. PyTorch requires C++17 as of 2024, and C++20 is starting to appear in newer additions to the codebase.

Common Mistakes

:::danger Accessing Freed Memory (Use-After-Free)

This is the most common cause of mysterious crashes in C extensions. The symptom is a segfault or corrupted output that only appears on certain inputs or batch sizes.

// WRONG: returning a pointer to a local (stack) variable
float* compute_activations(int n) {
float local_buffer[1024]; // lives on the stack
for (int i = 0; i < n; i++) local_buffer[i] = i * 0.1f;
return local_buffer; // WRONG: stack frame gone after return
}

// RIGHT: use unique_ptr - caller owns it
std::unique_ptr<float[]> compute_activations(int n) {
auto buf = std::make_unique<float[]>(n);
for (int i = 0; i < n; i++) buf[i] = i * 0.1f;
return buf; // moved to caller
}

ASAN catches this immediately at runtime. Always build and test with -fsanitize=address. :::

:::danger Integer Overflow in Size Calculations

This causes silent memory corruption - one of the hardest bugs to diagnose.

// WRONG: int overflow for large tensors
int n_rows = 100000;
int n_cols = 100000;
int total = n_rows * n_cols; // overflows: 10^10 > 2^31-1
float* data = new float[total]; // allocates wrong amount

// RIGHT: cast before multiplying
size_t total = static_cast<size_t>(n_rows) * n_cols;
float* data = new float[total];

In PyTorch C++ code you will see int64_t used consistently for tensor dimensions precisely to avoid this. :::

:::warning Mixing new/delete with malloc/free

// WRONG: mixing allocators is undefined behavior
float* p = (float*)malloc(100 * sizeof(float));
delete[] p; // undefined behavior

float* q = new float[100];
free(q); // undefined behavior

// RIGHT: match your allocator
float* p = (float*)malloc(100 * sizeof(float));
free(p);

// BEST: use std::vector or unique_ptr and never call free/delete manually

:::

:::warning TSAN False Positives with PyTorch Internals

TSAN reports false positives when used with PyTorch's internal thread pool (ATen's parallel dispatch). The internal pool uses lock-free data structures that TSAN does not fully understand. Suppress known false positives with a tsan_suppressions.txt file:

# tsan_suppressions.txt
race:at::internal::*
race:caffe2::ThreadPool*
TSAN_OPTIONS="suppressions=tsan_suppressions.txt" python3 test.py

Only suppress patterns you have confirmed are false positives. :::

Interview Questions and Answers

Q1: Why is PyTorch's core written in C++ rather than Python, and what does pybind11 do?

PyTorch's core is C++ because Python's interpreter overhead is prohibitive for tight numerical loops, and hardware (CUDA, AVX, BLAS) is accessed via C APIs. pybind11 is a header-only library that exposes C++ classes and functions to Python: it handles type conversion (Python int to C++ int64_t, NumPy array to C++ pointer), reference counting (so Python's GC and C++ destructors cooperate), and exception translation (C++ exceptions become Python exceptions). The Python torch.Tensor object is a pybind11-wrapped C++ at::Tensor object.

Q2: What is RAII and why does it matter for ML systems?

RAII (Resource Acquisition Is Initialization) ties resource lifetime to object lifetime. The resource is acquired in the constructor and released in the destructor. C++ guarantees destructors run when objects go out of scope, even when exceptions are thrown. For ML systems this matters for GPU handles (CUDA streams, cuBLAS handles, cuDNN handles), file descriptors for dataset reading, network sockets for distributed training, mutex locks, and memory-mapped file handles. Without RAII, every error path requires explicit cleanup code, and it is easy to miss one during refactoring, causing resource leaks or hangs on distributed training jobs.

Q3: What is the difference between std::unique_ptr and std::shared_ptr, and when does PyTorch use each?

std::unique_ptr represents sole ownership - exactly one pointer to the object, which is destroyed when the unique_ptr goes out of scope. Zero overhead versus a raw pointer. std::shared_ptr represents shared ownership via reference counting - multiple shared_ptrs can point to the same object, which is destroyed when the last shared_ptr is destroyed. PyTorch uses shared_ptr<StorageImpl> for tensor storage because multiple Python tensor objects can share one storage buffer (after slicing, transposing, etc.). It uses unique_ptr internally for objects with single owners, like dispatch key sets during operator lookup.

Q4: What are C++ templates and how does PyTorch use them for dtype dispatch?

Templates are a compile-time code generation mechanism: you write one function parameterized by a type (T) and the compiler generates a separate, fully optimized version for each concrete type (float, double, int8_t, etc.) you use. PyTorch's AT_DISPATCH_FLOATING_TYPES(tensor.scalar_type(), "relu", [&]() { relu_kernel<scalar_t>(data, n); }) macro checks the runtime dtype once, then dispatches to the correct template instantiation. The resulting code is as efficient as if you had written a separate relu_kernel_float, relu_kernel_double, etc.

Q5: How would you use ASAN to debug a segfault in a PyTorch custom op?

Build the extension with -fsanitize=address -g -O1. Set ASAN_OPTIONS=abort_on_error=1 so the process stops immediately. Run the Python script normally. ASAN instruments every memory access and prints a detailed error report showing the exact line of the out-of-bounds access, the allocation site, and a full stack trace. Common findings: off-by-one in loop bounds (i <= n instead of i < n), incorrect stride computation, or accessing a buffer after it was moved into a unique_ptr.

Q6: What is move semantics and why does it matter for returning tensors from C++ functions?

Move semantics (C++11) let you transfer ownership of a resource instead of copying it. When a function returns a std::vector<float> by value, instead of copying all the data (O(n)), the compiler uses the move constructor which transfers the internal pointer (O(1)). For a 100MB weight tensor, this is the difference between a fast return and a catastrophic copy. PyTorch's at::Tensor is designed to be cheaply movable: the Python object holds a shared_ptr to the storage, and moving the tensor transfers that pointer without copying data. Always return at::Tensor by value - the move constructor handles it efficiently.

Q7: What does if constexpr do and how is it different from a regular if?

if constexpr (C++17) is evaluated at compile time. The branch that is not taken is not instantiated - it does not even need to be valid code for the current template parameters. A regular if is evaluated at runtime: both branches must compile, and the condition check has a small runtime cost. In templates, if constexpr is essential for handling different types differently without writing separate template specializations. For example, if constexpr (std::is_same_v<T, float>) can call a float-specific SIMD intrinsic in one branch and a generic fallback in the other, with no overhead for the non-taken path.

Q8: What is the ATen dispatch table in PyTorch and how are operators registered?

ATen (A Tensor library) maintains a dispatch table that maps (operator_name, dispatch_key) to a KernelFunction. The dispatch key encodes device (CPU, CUDA, XLA) and any transforms (Autograd, Batched, etc.). When you call tensor.relu(), PyTorch resolves the dispatch keys for the tensor's backend, looks up the appropriate KernelFunction in the table, and calls it. Custom ops are registered at load time using TORCH_LIBRARY and TORCH_LIBRARY_IMPL macros (or pybind11 wrappers). The dispatch happens before any kernel code runs, adding roughly 1-2 microseconds of overhead per Python-level op call - negligible for large tensors, but significant for many small ops in a loop.

© 2026 EngineersOfAI. All rights reserved.