SIMD and Vectorization

The Compiler Flag That Made the Model 8x Faster

A startup was running a recommendation model in production. Inference was serving about 4,000 requests per second on a single c5.2xlarge instance (Intel Skylake). The infrastructure bill was growing fast. Before scaling horizontally, the ML lead decided to profile the inference path properly.

The model was mostly NumPy operations - dot products, softmax, a few custom post-processing loops written in pure Python and Cython. Nothing exotic. The Cython-compiled extension had been written carefully, types annotated, boundscheck disabled. But one thing was missing: the compiler flag -march=native had never been added to the setup.py build configuration. The default Cython compilation target was x86-64 baseline - which means no AVX, no AVX2, no FMA. The CPU was executing SSE2 instructions (128-bit, 2 doubles per operation) instead of AVX2 (256-bit, 4 doubles per operation) and missing FMA (Fused Multiply-Add, which combines a multiply and add into a single instruction with no extra latency).

Adding -march=native -O3 -ffast-math to the Cython build flags, recompiling, and redeploying took 45 minutes. Throughput jumped from 4,000 to 31,000 requests per second - a 7.75x increase. The hardware did not change. The algorithm did not change. The data did not change. The only thing that changed was whether the compiler was allowed to use the vector instructions the CPU had supported for 8 years.

This story is not unusual. It happens constantly across the ML infrastructure world because most engineers do not fully understand what SIMD is, which operations benefit from it, and how to ensure their code takes advantage of it. This lesson closes that gap.

Why This Exists - The Scalar Throughput Wall

By the mid-1990s, CPU clock speeds were scaling rapidly but single-instruction throughput was hitting a wall. A floating-point addition took 3-5 cycles. A multiplication took 4-6 cycles. Pipelining helped but the fundamental issue was that each instruction operated on one number at a time. To double throughput, you doubled clock speed - which meant doubling power consumption and heat.

The insight from the graphics community was that 3D graphics require exactly the same operation applied to thousands of vertices simultaneously. If the instruction ADD R1, R2 adds one float, why not have an instruction that adds 4 floats simultaneously, using the same decode/execute hardware with just a wider data path? The hardware cost of widening the data path from 32-bit to 128-bit is modest compared to adding entirely new execution units.

This is the essence of SIMD: Single Instruction, Multiple Data. One instruction, one decode, one control path - but operating on a vector of data elements simultaneously. The same energy and silicon area that previously processed one float now processes 4, 8, or 16 floats.

Historical Context - From MMX to AVX-512

Intel's first SIMD extension, MMX (1996, Pentium MMX), operated on 64-bit registers partitioned into 8-bit or 16-bit integers. It was designed for multimedia - the M in MMX. Awkward to use, no floating-point support, and it aliased the x87 FPU registers (requiring an EMMS instruction to switch between MMX and FPU code).

SSE (Streaming SIMD Extensions, 1999, Pentium III) added 8 new 128-bit XMM registers and floating-point SIMD operations. SSE2 (2001, Pentium 4) extended this to 64-bit doubles and integer operations. SSE4 (2007) added integer operations useful for media codecs. For the first time, ML and scientific computing code could efficiently vectorize double-precision arithmetic.

AVX (Advanced Vector Extensions, 2011, Sandy Bridge) doubled the register width to 256-bit YMM registers, enabling 8 single-precision or 4 double-precision floats per instruction. AVX2 (2013, Haswell) added integer AVX operations and, critically, FMA (Fused Multiply-Add) instructions. FMA computes $a \times b + c$ in a single instruction with a single rounding - this is the fundamental operation of matrix multiplication and neural network linear layers.

AVX-512 (2017, Skylake-X) doubled again to 512-bit ZMM registers: 16 floats or 8 doubles per instruction. AVX-512 also added masked operations (apply instruction only to selected elements), scatter/gather (load/store from non-contiguous addresses), and new reduction instructions.

ARM NEON (2005, ARMv7) is ARM's equivalent: 128-bit SIMD with 16 128-bit Q registers, supporting float32 and int operations. Apple Silicon (M1, M2, M3) extends this with custom additions optimized for ML workloads. AWS Graviton3 CPUs support SVE (Scalable Vector Extension), with variable-width vectors up to 2048 bits.

SIMD Register Width and Throughput

The relationship between register width and throughput is direct. An AVX2 instruction operating on 256-bit registers processes:

8 x float32 (single precision, 4 bytes each)
4 x float64 (double precision, 8 bytes each)
32 x int8 (byte-sized integers)

In one clock cycle (on a pipelined FPU), this means:

$\text{FLOPS per cycle} = \text{SIMD width} \times \text{FMA factor} \times \text{ports}$

For a modern Intel CPU with AVX2:

4 doubles per AVX2 FMA operation
FMA = 2 FLOPS (one multiply + one add)
2 FMA execution ports

$= 4 \times 2 \times 2 = 16 \text{ double-precision FLOPS per cycle}$

For AVX-512:

$= 8 \times 2 \times 2 = 32 \text{ double-precision FLOPS per cycle}$

This is why advertised CPU FLOPS numbers (like "1.6 TFLOPS" for a modern Xeon) are 16x to 32x higher than what you would estimate from clock speed alone. They assume 100% AVX-512 FMA utilization, which requires perfectly vectorized, dependency-free code.

Auto-Vectorization - When the Compiler Does It For You

Modern compilers (GCC, Clang, MSVC with optimization flags) can automatically convert scalar loops into SIMD instructions. This is called auto-vectorization. It is the easiest way to get SIMD performance without writing intrinsics.

Conditions for auto-vectorization (all must hold):

Loop count is known or estimable at compile time (or the loop has a constant trip count)
No data dependencies between iterations (iteration N cannot depend on result of iteration N-1)
Memory access is contiguous or regular stride (random scatter/gather defeats most auto-vectorizers)
No aliasing (the compiler must know that input and output arrays do not overlap)
Simple loop body (function calls, complex conditionals, or indirect accesses often prevent vectorization)

// Auto-vectorizable: sequential access, no loop-carried dependency, no aliasing
void add_arrays(float* restrict out, const float* restrict a, const float* restrict b, int n) {
    for (int i = 0; i < n; i++) {
        out[i] = a[i] + b[i];   // maps to vmovups + vaddps + vmovups (AVX)
    }
}

// NOT auto-vectorizable: loop-carried dependency
void prefix_sum(float* arr, int n) {
    for (int i = 1; i < n; i++) {
        arr[i] += arr[i-1];   // arr[i] depends on arr[i-1] - serial dependency
    }
}

// NOT auto-vectorizable: indirect memory access (gather)
void gather_sum(float* out, const float* arr, const int* indices, int n) {
    for (int i = 0; i < n; i++) {
        out[i] = arr[indices[i]];  // irregular access pattern
    }
}

The restrict keyword in C is critical: it tells the compiler that the pointer does not alias any other pointer in scope, enabling aggressive vectorization. Without restrict, the compiler must assume that out and a might overlap, which prevents many SIMD transforms.

NumPy and BLAS - Why NumPy Is Fast

When you call np.dot(A, B) or A @ B in NumPy for matrices larger than a certain threshold (typically 16x16), NumPy delegates to BLAS (Basic Linear Algebra Subprograms) - a library interface with highly optimized implementations for each CPU architecture.

The most common BLAS implementations:

OpenBLAS: Open-source, supports AVX/AVX2/AVX-512 and ARM NEON. Default in many NumPy distributions.
Intel MKL (Math Kernel Library): Intel's proprietary BLAS, highly optimized for Intel CPUs. Available in Intel Distribution of Python and Anaconda.
BLIS: Another open-source BLAS that is modular and easily portable.

These libraries do not just call AVX intrinsics once. They use a carefully tuned blocked algorithm:

Divide matrices into tiles that fit L1, L2, and L3 cache simultaneously
Pack the tiles into special memory layouts that maximize cache line utilization
Use handwritten AVX-512 FMA instructions in the innermost kernel (the "microkernel")
Handle non-aligned edges and non-square matrices with scalar fallback code

The result is code that achieves 85-95% of theoretical peak FLOPS. This is why np.dot on a modern machine approaches advertised CPU throughput.

import numpy as np
import time


def benchmark_matmul():
    """
    Compare scalar vs BLAS matrix multiplication.
    BLAS uses AVX-512 FMA internally (if available).
    """
    N = 2048

    A = np.random.rand(N, N).astype(np.float64)
    B = np.random.rand(N, N).astype(np.float64)

    # Warmup
    for _ in range(3):
        _ = A @ B

    # BLAS matmul (np.dot uses DGEMM under the hood)
    start = time.perf_counter()
    runs = 10
    for _ in range(runs):
        C = A @ B
    blas_time = (time.perf_counter() - start) / runs

    # Theoretical FLOPS: 2 * N^3 (N^3 multiplies + N^3 adds)
    flops = 2 * N**3
    achieved_gflops = flops / blas_time / 1e9

    print(f"Matrix size: {N}x{N}")
    print(f"BLAS time: {blas_time*1000:.2f} ms")
    print(f"Achieved: {achieved_gflops:.1f} GFLOPS")
    print("(Compare to CPU's advertised peak GFLOPS)")


# Check which BLAS is being used
np.show_config()  # Shows BLAS/LAPACK info

benchmark_matmul()

How PyTorch Uses AVX2 and AVX-512

PyTorch's ATen library (the tensor computation engine) has backend code paths for multiple SIMD levels:

at::cpu_kernel - baseline scalar
at::cpu_kernel with VECTORIZE macro - auto-vectorized via Eigen/vec.h
Handwritten AVX2 kernels in aten/src/ATen/native/cpu/
AVX-512 kernels where available

PyTorch dispatches to the widest available SIMD level at startup (via CPU feature detection using CPUID). The dispatch happens through function pointer tables - the correct kernel for your CPU is selected once at load time and called on every subsequent operation.

For element-wise operations (relu, sigmoid, tanh, add, mul), PyTorch uses vectorized loops that process 8 float32 values per iteration on AVX2, or 16 on AVX-512. For matrix operations (linear layers, convolutions), it falls through to BLAS/LAPACK or uses oneDNN (Intel's optimized library, formerly MKL-DNN).

import torch
import time


def check_torch_simd():
    """
    Verify PyTorch is using vectorized operations.
    Compare element-wise ops vs manual Python loop.
    """
    print("CPU capabilities:", torch.__config__.show())

    N = 10_000_000
    x = torch.randn(N, dtype=torch.float32)

    # PyTorch ReLU - uses vectorized comparison + mask
    start = time.perf_counter()
    for _ in range(100):
        y = torch.relu(x)
    torch_time = (time.perf_counter() - start) / 100

    # NumPy equivalent
    import numpy as np
    xn = x.numpy()
    start = time.perf_counter()
    for _ in range(100):
        yn = np.maximum(xn, 0)
    numpy_time = (time.perf_counter() - start) / 100

    print(f"PyTorch relu: {torch_time*1000:.3f} ms")
    print(f"NumPy max:    {numpy_time*1000:.3f} ms")
    # Both should be similar - both use SIMD under the hood


def benchmark_vectorized_ops():
    """
    Show that vectorized ops scale with AVX width.
    Float32 should be 2x faster than float64 (2x more per vector register).
    """
    N = 100_000_000

    x32 = torch.randn(N, dtype=torch.float32)
    x64 = torch.randn(N, dtype=torch.float64)

    start = time.perf_counter()
    _ = torch.sum(x32)
    f32_time = time.perf_counter() - start

    start = time.perf_counter()
    _ = torch.sum(x64)
    f64_time = time.perf_counter() - start

    print(f"float32 sum: {f32_time*1000:.3f} ms")
    print(f"float64 sum: {f64_time*1000:.3f} ms")
    print(f"f32 speedup: {f64_time / f32_time:.2f}x")
    # Expected: ~2x speedup for float32 (fits 2x more per AVX register)


check_torch_simd()
benchmark_vectorized_ops()

Aligned vs Unaligned Memory

AVX instructions come in two flavors: aligned (vmovaps - vector move aligned packed single) and unaligned (vmovups - vector move unaligned packed single). On modern CPUs (Nehalem and later), the performance difference is small (1-2 cycles) for aligned vs unaligned loads. But on older CPUs and across page boundaries, misaligned loads can cause performance penalties or (in rare cases with some intrinsics) exceptions.

Memory alignment means the starting address is a multiple of the vector width:

SSE (128-bit): align to 16 bytes
AVX (256-bit): align to 32 bytes
AVX-512 (512-bit): align to 64 bytes

NumPy guarantees 64-byte alignment for all freshly allocated arrays. PyTorch tensors are similarly aligned. Problems arise when:

You create a view with a non-zero offset that breaks alignment
You use Python's ctypes or struct to pass data without alignment guarantees
You slice an array starting at an odd element

import numpy as np


def check_alignment():
    """
    Check memory alignment of NumPy arrays.
    NumPy guarantees 64-byte alignment for fresh allocations.
    """
    # Fresh allocation - guaranteed aligned
    arr = np.zeros(100, dtype=np.float64)
    addr = arr.ctypes.data
    print(f"Fresh array address: {hex(addr)}")
    print(f"Aligned to 64 bytes: {addr % 64 == 0}")

    # View starting at element 1 - may break alignment
    arr_view = arr[1:]
    addr_view = arr_view.ctypes.data
    print(f"View[1:] address: {hex(addr_view)}")
    print(f"Aligned to 64 bytes: {addr_view % 64 == 0}")
    # arr_view starts 8 bytes after arr, breaking 64-byte alignment

    # Check if array is contiguous and aligned
    print(f"Is C-contiguous: {arr.flags['C_CONTIGUOUS']}")

    # np.ascontiguousarray forces contiguous + aligned copy if needed
    arr_aligned = np.ascontiguousarray(arr_view)
    addr_aligned = arr_aligned.ctypes.data
    print(f"Re-allocated aligned: {addr_aligned % 64 == 0}")


check_alignment()

Gather and Scatter Operations

Standard SIMD operates on contiguous memory (load N consecutive elements into a vector register). But many ML operations need to load elements at non-contiguous addresses - embedding lookups, sparse operations, gather operations for attention mechanisms.

AVX-512 (and AVX2 partially) provides gather instructions (vgatherdps, vgatherdpd) that load N elements from N arbitrary addresses in a single instruction. This is slower than a contiguous load (because the memory controller cannot coalesce these accesses) but faster than N separate scalar loads.

import numpy as np
import time


def compare_gather_vs_sequential():
    """
    Compare sequential (SIMD-optimal) vs gather (irregular) access.
    Shows why embedding lookups in recommendation models are slow.
    """
    N = 1_000_000
    embedding_dim = 64
    table_rows = 100_000

    # Embedding table
    embeddings = np.random.rand(table_rows, embedding_dim).astype(np.float32)

    # Sequential access (hypothetical - accessing rows 0,1,2,...N-1)
    # This is what a matrix multiply looks like - cache-friendly
    indices_sequential = np.arange(N % table_rows)

    # Random access (realistic recommendation model)
    # User IDs are random - no locality
    indices_random = np.random.randint(0, table_rows, size=N)

    start = time.perf_counter()
    result_seq = embeddings[indices_sequential]
    seq_time = time.perf_counter() - start

    start = time.perf_counter()
    result_rand = embeddings[indices_random]
    rand_time = time.perf_counter() - start

    print(f"Sequential gather: {seq_time*1000:.2f} ms")
    print(f"Random gather:     {rand_time*1000:.2f} ms")
    print(f"Random slowdown:   {rand_time / seq_time:.2f}x")
    # Large random gather will be cache-miss-limited, not compute-limited


compare_gather_vs_sequential()

Loop Unrolling - Exposing More ILP to the SIMD Unit

Even with SIMD, a loop body with a single vectorized instruction may not fully utilize the execution pipeline. Modern CPUs have 2 FMA execution ports - they can execute two AVX2 FMA instructions every cycle. A single-instruction loop body has a loop-carried dependency (loop counter update, branch) that limits throughput.

Loop unrolling processes multiple iterations per loop body, allowing the compiler to issue multiple independent SIMD instructions per cycle:

import numpy as np
import time


def benchmark_unrolling_effect():
    """
    Demonstrate why NumPy's vectorized kernels process multiple
    elements per iteration beyond just the SIMD width.
    """
    N = 100_000_000

    a = np.random.rand(N).astype(np.float32)
    b = np.random.rand(N).astype(np.float32)

    # Single NumPy operation - let NumPy use its optimized kernel
    start = time.perf_counter()
    c = a + b
    single_time = time.perf_counter() - start

    # NumPy internally unrolls: its vectorized kernel processes
    # multiple AVX2 vectors per iteration (typically 4-8 vectors unrolled)
    # to keep both FMA ports busy.

    # Compare float32 vs float64 to confirm SIMD width effect
    a64 = a.astype(np.float64)
    b64 = b.astype(np.float64)
    start = time.perf_counter()
    c64 = a64 + b64
    f64_time = time.perf_counter() - start

    print(f"float32 add {N:,} elements: {single_time*1000:.2f} ms")
    print(f"float64 add {N:,} elements: {f64_time*1000:.2f} ms")
    print(f"float32 speedup: {f64_time / single_time:.2f}x")
    # Expected: ~2x for float32 (2x per AVX2 register width)


benchmark_unrolling_effect()

Compiler Vectorization Flags

The difference between compiled code that uses SIMD and code that does not often comes down to a single compiler flag.

# GCC / Clang vectorization flags

# -O2: basic optimizations, some auto-vectorization
gcc -O2 my_code.c -o my_code

# -O3: aggressive optimization, full auto-vectorization enabled
gcc -O3 my_code.c -o my_code

# -march=native: use all SIMD extensions available on THIS machine
# (binary will not run on older CPUs - only for local use or specific deployment)
gcc -O3 -march=native my_code.c -o my_code

# -march=x86-64-v3: use AVX2 (widely supported, good target for 2019+ hardware)
gcc -O3 -march=x86-64-v3 my_code.c -o my_code

# -march=x86-64-v4: use AVX-512 (2020+ Intel, 2022+ AMD)
gcc -O3 -march=x86-64-v4 my_code.c -o my_code

# -ffast-math: allow reordering of floating-point operations
# (may change results slightly but enables FMA and other optimizations)
# CAUTION: can break IEEE 754 compliance, affects NaN handling
gcc -O3 -march=native -ffast-math my_code.c -o my_code

# -fopt-info-vec-optimized: show which loops were vectorized
gcc -O3 -march=native -fopt-info-vec-optimized my_code.c -o my_code 2>&1 | grep "vectorized"

# -fopt-info-vec-missed: show which loops could NOT be vectorized (and why)
gcc -O3 -march=native -fopt-info-vec-missed my_code.c -o my_code 2>&1 | grep "not vectorized"

# For Python extensions (Cython, ctypes, cffi):
# In setup.py
extra_compile_args = [
    "-O3",
    "-march=native",   # or "-march=x86-64-v3" for distribution
    "-ffast-math",
    "-funroll-loops",
]

ARM NEON vs Intel AVX - Architectural Comparison

For ML engineers deploying on AWS Graviton, Apple Silicon, or mobile edge devices, understanding ARM NEON is essential.

Feature	Intel AVX2	Intel AVX-512	ARM NEON	ARM SVE
Register width	256-bit	512-bit	128-bit	128-2048-bit (scalable)
float32 per reg	8	16	4	4-64
float64 per reg	4	8	2	2-32
FMA support	Yes	Yes	Yes (FMLA)	Yes
Masked ops	Partial	Full (k-masks)	Partial	Full (predicate regs)
Gather	Partial	Full	No	Yes
Typical CPU	Intel/AMD server	Intel Xeon	AWS Graviton, Apple M	AWS Graviton3, Neoverse V1

ARM NEON has narrower registers (128-bit) but more of them (32 Q registers vs 16 YMM). SVE is ARM's answer to AVX-512: variable-width SIMD where the same binary runs on hardware with different vector widths (future-proof). Apple's AMX (Apple Matrix Extension) in M1/M2 chips provides matrix operation acceleration beyond NEON.

import platform
import subprocess


def detect_simd_capabilities():
    """
    Detect available SIMD extensions on the current CPU.
    """
    arch = platform.machine()
    print(f"Architecture: {arch}")

    if arch in ("x86_64", "AMD64"):
        # Read /proc/cpuinfo on Linux
        try:
            with open("/proc/cpuinfo") as f:
                cpuinfo = f.read()
            flags_line = [l for l in cpuinfo.split("\n") if l.startswith("flags")][0]
            flags = flags_line.split(":")[1].split()
            simd_flags = [f for f in flags if f in
                          ("sse", "sse2", "sse4_1", "sse4_2", "avx", "avx2", "avx512f",
                           "avx512bw", "avx512vl", "fma")]
            print(f"SIMD flags: {simd_flags}")
        except (FileNotFoundError, IndexError):
            # macOS: use sysctl
            result = subprocess.run(
                ["sysctl", "-n", "machdep.cpu.features"],
                capture_output=True, text=True
            )
            print(f"CPU features: {result.stdout.strip()}")

    elif arch in ("arm64", "aarch64"):
        print("ARM64: NEON is always available (mandatory in ARMv8)")
        try:
            result = subprocess.run(
                ["sysctl", "-n", "hw.optional.arm.FEAT_SVE"],
                capture_output=True, text=True
            )
            if result.returncode == 0:
                sve = result.stdout.strip()
                print(f"SVE available: {sve == '1'}")
        except FileNotFoundError:
            pass


detect_simd_capabilities()

Cython SIMD Example - Writing Vectorized Extensions

When auto-vectorization is not sufficient and you need explicit control, Cython with C SIMD intrinsics is the practical Python-ecosystem solution:

# simd_ops.pyx - Cython extension with SIMD hints

import numpy as np
cimport numpy as np
from libc.stdlib cimport malloc, free


def vectorized_relu(np.ndarray[np.float32_t, ndim=1] x):
    """
    ReLU activation using Cython with auto-vectorization hints.
    The typed memoryview + no-GIL + simple loop enables
    the compiler to generate AVX2 code with -O3 -march=native.
    """
    cdef int n = len(x)
    cdef np.ndarray[np.float32_t, ndim=1] out = np.empty(n, dtype=np.float32)
    cdef float[:] x_view = x
    cdef float[:] out_view = out
    cdef int i

    # With -O3 -march=native, this loop auto-vectorizes to 8 elements/iter (AVX2)
    # or 16 elements/iter (AVX-512)
    with nogil:
        for i in range(n):
            out_view[i] = x_view[i] if x_view[i] > 0.0 else 0.0

    return out


# setup.py for compilation with vectorization flags:
#
# from setuptools import setup
# from Cython.Build import cythonize
# import numpy as np
#
# setup(
#     ext_modules=cythonize(
#         "simd_ops.pyx",
#         compiler_directives={
#             "boundscheck": False,
#             "wraparound": False,
#             "cdivision": True,
#         }
#     ),
#     include_dirs=[np.get_include()],
#     extra_compile_args=["-O3", "-march=native", "-ffast-math", "-funroll-loops"],
# )

Checking Vectorization in Practice

import numpy as np
import time


def check_vectorization_width():
    """
    Empirically verify SIMD is active by checking float32 vs float64 throughput.
    If AVX2 is active: float32 should be ~2x faster (8 vs 4 per register).
    If AVX-512 is active: float32 should be ~2x faster (16 vs 8 per register).
    """
    N = 200_000_000
    runs = 5

    a32 = np.random.rand(N).astype(np.float32)
    b32 = np.random.rand(N).astype(np.float32)
    a64 = a32.astype(np.float64)
    b64 = b32.astype(np.float64)

    # Warmup
    _ = a32 + b32
    _ = a64 + b64

    start = time.perf_counter()
    for _ in range(runs):
        c32 = a32 + b32
    t32 = (time.perf_counter() - start) / runs

    start = time.perf_counter()
    for _ in range(runs):
        c64 = a64 + b64
    t64 = (time.perf_counter() - start) / runs

    ratio = t64 / t32
    bw32 = (N * 3 * 4) / t32 / 1e9  # 3 arrays * 4 bytes
    bw64 = (N * 3 * 8) / t64 / 1e9  # 3 arrays * 8 bytes

    print(f"float32 time: {t32*1000:.2f} ms, bandwidth: {bw32:.1f} GB/s")
    print(f"float64 time: {t64*1000:.2f} ms, bandwidth: {bw64:.1f} GB/s")
    print(f"Ratio f64/f32: {ratio:.2f}x (expected ~2.0x if SIMD active)")

    if ratio > 1.7:
        print("SIMD vectorization appears ACTIVE (f32 significantly faster than f64)")
    else:
        print("WARNING: SIMD may not be active, or bandwidth-limited")


check_vectorization_width()

Production Engineering Notes

Verifying SIMD in Your NumPy Installation

# Check NumPy CPU features
python3 -c "import numpy as np; np.show_config()"

# Check if MKL or OpenBLAS
python3 -c "import numpy as np; print(np.__config__.blas_opt_info)"

# Check which CPU features NumPy was compiled with
python3 -c "import numpy; print(numpy.distutils.system_info.cpu_info)"

# For Intel MKL, verify AVX-512 is enabled
python3 -c "
import numpy as np
import ctypes
# MKL version check
try:
    mkl = ctypes.CDLL('libmkl_rt.so')
    print(f'MKL available: True')
except:
    print('MKL not found, using OpenBLAS or other BLAS')
"

PyTorch CPU Feature Detection

import torch


def show_torch_cpu_info():
    """Show what SIMD extensions PyTorch is using."""
    print("PyTorch build config:")
    print(torch.__config__.show())

    # Check if AVX2 is available
    print(f"\nCPU capabilities (torch.backends):")

    # Check for specific optimizations
    print(f"MKL available: {torch.backends.mkl.is_available()}")
    print(f"MKL-DNN (oneDNN) available: {torch.backends.mkldnn.is_available()}")

    # This affects linear layer performance significantly
    if torch.backends.mkldnn.is_available():
        print("oneDNN: PyTorch will use MKL-DNN for conv/linear on CPU")
        print("This uses AVX-512 or AVX2 internally")


show_torch_cpu_info()

Disabling AVX-512 When It Hurts Performance

AVX-512 comes with a surprising gotcha: enabling AVX-512 on Intel CPUs causes a frequency reduction (down-clock) of 100-400 MHz because the wider vector units draw more power. For code that is not sufficiently vectorized to benefit from AVX-512, this frequency reduction can make performance worse than AVX2.

# Check AVX-512 throttling behavior
# Intel CPUs have different "license levels" for frequency:
# License 0: No AVX - full frequency
# License 1: AVX2 - slight reduction
# License 2: AVX-512 - significant reduction

# For inference servers where latency matters more than throughput:
# Consider disabling AVX-512 to avoid frequency reduction
# Environment variable for MKL:
export MKL_ENABLE_INSTRUCTIONS=AVX2

# For PyTorch:
# No direct flag, but you can recompile or use AVX2-targeted build

warning

-ffast-math allows the compiler to reorder floating-point operations. This can change the result of your computation because floating-point addition is not associative: $(a + b) + c \neq a + (b + c)$ in floating-point arithmetic due to rounding. For ML training, these differences are typically negligible (below gradient noise). For financial calculations or scientific computing where exact reproducibility matters, do not use -ffast-math.

danger

AVX-512 frequency throttling on Intel Skylake-X and Ice Lake CPUs can reduce clock speed by 300-400 MHz when AVX-512 code is active. For workloads that mix AVX-512 vectorized sections with scalar code, the processor takes time to transition back to full frequency. Benchmark carefully on your specific hardware. In some workloads, disabling AVX-512 and using AVX2 results in higher throughput due to eliminating frequency throttling penalties.

Common Mistakes

danger

Compiling Python extensions without vectorization flags

The default compilation of Cython or C extensions via python setup.py build_ext uses -O0 or -O2 without -march=native. This generates scalar code for any CPU that is older than the compilation target. For a production ML system compiled on an older CI server (which may not have AVX2) and deployed to newer hardware, you get no SIMD benefit at all. Always specify explicit compiler flags in setup.py and match them to your deployment target CPU generation.

warning

Expecting auto-vectorization with indirect accesses

Any loop of the form out[i] = table[indices[i]] (embedding lookup, hash table, sparse operations) cannot be auto-vectorized by most compilers because the memory access pattern is unpredictable at compile time. AVX-512 gather can help but requires explicit intrinsics or libraries. For embedding lookups, the bottleneck is almost always memory bandwidth (cache misses), not compute throughput - adding SIMD to embedding lookups typically has minimal effect unless the embeddings are in cache.

danger

Assuming NumPy operations are always SIMD-accelerated

Not all NumPy operations use the SIMD-accelerated BLAS path. Element-wise Python operations on small arrays may go through Python object protocol overhead. Functions like np.sort, np.argsort, and complex aggregations may use scalar fallback paths. Always benchmark with production-scale array sizes. The BLAS path typically activates for matrix multiplications above ~16x16 elements.

warning

Mixing float32 and float64 in hot paths

NumPy will silently upcast float32 to float64 when operations mix types. float32_array + float64_scalar produces a float64 result, losing all float32 SIMD benefits (2x fewer elements per register). In PyTorch, this can happen when scalar constants are Python floats (which are float64). Always be explicit about dtypes in performance-sensitive code, and prefer float32 unless you need the precision.

Interview Questions and Answers

Q1: What is SIMD and how does it increase throughput without increasing clock speed?

SIMD (Single Instruction, Multiple Data) is a CPU execution model where one instruction operates on multiple data elements simultaneously. Instead of adding two scalars, an AVX2 instruction adds two vectors of 8 float32 values. The hardware cost is primarily a wider data path and wider registers - the decode logic, control flow, and execution units remain similar. One instruction still takes one cycle to execute (pipelined), but it produces 8 results instead of 1. For a CPU running at 3 GHz with 8-wide SIMD and 2 FMA ports, the peak float32 throughput is: 3e9 * 8 * 2 * 2 = 96 GFLOPS - the same clock frequency producing 96x more work than a scalar machine.

Q2: What conditions must hold for a compiler to auto-vectorize a loop?

The compiler must prove:

No loop-carried data dependencies (iteration N must not read a value written in iteration N-1)
No pointer aliasing (input and output arrays do not overlap - use restrict in C)
The loop has a known or estimable trip count (or the compiler can generate a prologue to handle non-multiple residuals)
Memory accesses are contiguous or have a known constant stride
The loop body does not contain function calls that cannot be inlined or vectorized
There are no side effects the compiler cannot reason about

If any of these fails, the compiler will emit a scalar loop. Use -fopt-info-vec-missed (GCC) or -Rpass-missed=loop-vectorize (Clang) to see exactly why a loop was not vectorized.

Q3: Explain why np.dot is so much faster than a Python loop for matrix multiplication.

Several compounding factors:

BLAS dispatch: np.dot dispatches to an optimized BLAS implementation (OpenBLAS or MKL). These use handwritten assembly with AVX-512 FMA instructions in their inner kernels.
Cache-blocked algorithm: BLAS uses a 3-level blocking algorithm where each level matches a cache level (L1 inner block, L2 outer block, L3 outer-outer block). This maximizes cache reuse.
SIMD width: AVX2 processes 4 doubles per FMA instruction; AVX-512 processes 8. A Python loop processes 1.
FMA instructions: A single FMA computes a*b+c in one cycle. A Python loop uses separate multiply and add operations with the overhead of Python object protocol between them.
No Python overhead: BLAS operates on raw C arrays of doubles. Python loops work through Python float objects (28 bytes each, heap-allocated, garbage-collected).

Q4: What is the difference between SSE, AVX2, and AVX-512? When would you target each?

SSE/SSE2 (128-bit): 4 x float32 or 2 x float64. Available on all x86-64 CPUs made since 2001. Safe baseline for any x86 deployment.
AVX2 (256-bit): 8 x float32 or 4 x float64. Available on Intel Haswell (2013) and AMD Zen (2017) and later. Good target for production code - covers nearly all servers in use today.
AVX-512 (512-bit): 16 x float32 or 8 x float64. Intel Skylake-X (2017) server CPUs and Ice Lake (2021), AMD Zen 4 (2022). Best peak throughput but carries Intel frequency throttling risk.

For distributable software targeting generic x86 hardware, AVX2 (-march=x86-64-v3) is the sweet spot. For private ML inference clusters on known hardware, AVX-512 (-march=native) maximizes throughput. For ARM (AWS Graviton, Apple M-series), target NEON via -march=native on ARM64.

Q5: What is FMA (Fused Multiply-Add) and why does it matter for neural networks?

FMA computes $a \times b + c$ as a single instruction with a single rounding step. Without FMA, you need a separate multiply instruction followed by an add instruction, each with its own rounding. FMA is important for neural networks for three reasons:

Throughput: One FMA instruction performs 2 FLOPS (a multiply and an add) in one instruction slot. Without FMA, you need 2 instruction slots for the same work, halving throughput.
Accuracy: A single rounding step in FMA preserves more precision than two separate roundings. This matters for training stability in low-precision arithmetic.
BLAS kernel efficiency: The innermost loop of matrix multiplication is precisely a dot product: repeated $a \times b + c$ operations. BLAS kernels are built around FMA chains that fill both FMA execution ports every cycle.

FMA is available in AVX2 (Intel Haswell+), AVX-512, and ARM NEON (FMLA instruction). Enabling FMA via -march=native or -mfma is one of the single highest-impact compiler flags for ML code.

Q6: How does PyTorch use SIMD under the hood for CPU inference?

PyTorch's ATen library contains multiple backends:

Default scalar kernel: Plain C++ loops for correctness reference
Vectorized kernel (vec.h / Vectorized class): Template-based SIMD abstraction that generates AVX2 or AVX-512 code depending on compile-time detected CPU flags
oneDNN/MKL-DNN kernels: For convolution and matrix operations, PyTorch calls Intel's oneDNN library, which uses hand-optimized AVX-512 code paths
FBGEMM: Facebook's optimized matrix-vector multiplication library for inference, with int8 quantization and AVX-512 VNNI (vector neural network instructions) support

At model load time, PyTorch runs CPUID to detect available SIMD extensions and selects the appropriate kernel backend for each operation. Operators like torch.relu, torch.sigmoid, torch.add all have vectorized backends that process 8 or 16 elements per iteration on modern CPUs. The torch.compile() interface (PyTorch 2.0+) can additionally generate optimized SIMD code via the TorchInductor backend.

The Compiler Flag That Made the Model 8x Faster​

Why This Exists - The Scalar Throughput Wall​

Historical Context - From MMX to AVX-512​

SIMD Register Width and Throughput​

Auto-Vectorization - When the Compiler Does It For You​

NumPy and BLAS - Why NumPy Is Fast​

How PyTorch Uses AVX2 and AVX-512​

Aligned vs Unaligned Memory​

Gather and Scatter Operations​

Loop Unrolling - Exposing More ILP to the SIMD Unit​

Compiler Vectorization Flags​

ARM NEON vs Intel AVX - Architectural Comparison​

Cython SIMD Example - Writing Vectorized Extensions​

Checking Vectorization in Practice​

Production Engineering Notes​

Verifying SIMD in Your NumPy Installation​

PyTorch CPU Feature Detection​

Disabling AVX-512 When It Hurts Performance​

Common Mistakes​

Interview Questions and Answers​

The Compiler Flag That Made the Model 8x Faster

Why This Exists - The Scalar Throughput Wall

Historical Context - From MMX to AVX-512

SIMD Register Width and Throughput

Auto-Vectorization - When the Compiler Does It For You

NumPy and BLAS - Why NumPy Is Fast

How PyTorch Uses AVX2 and AVX-512

Aligned vs Unaligned Memory

Gather and Scatter Operations

Loop Unrolling - Exposing More ILP to the SIMD Unit

Compiler Vectorization Flags

ARM NEON vs Intel AVX - Architectural Comparison

Cython SIMD Example - Writing Vectorized Extensions

Checking Vectorization in Practice

Production Engineering Notes

Verifying SIMD in Your NumPy Installation

PyTorch CPU Feature Detection

Disabling AVX-512 When It Hurts Performance

Common Mistakes

Interview Questions and Answers