GPU vs CPU Architecture

Reading time: ~35 min · Interview relevance: High · Target roles: ML Engineer, Systems Engineer, AI Infrastructure

A CPU has 16 fast cores. A GPU has 16,000 slow cores. Matrix multiplication does not need fast cores - it needs thousands of cores doing simple operations simultaneously.

You are three weeks into training a 7B parameter language model on a single A100. Loss is decreasing. Throughput is 42,000 tokens per second. Then your infrastructure lead mentions that the same cluster, configured differently, was hitting 110,000 tokens per second last quarter on a similar model. You have not changed the code. You have not changed the hardware. The difference is entirely in how well the workload maps to the GPU's execution model.

This is not a rare situation. Most ML engineers use GPUs without understanding them. They know GPUs are fast for matrix math. They know CUDA exists. But they do not know why a GPU is fast, which means they cannot reason about when it is not fast, and they cannot explain the 2.6x throughput gap between their run and the previous one.

The gap almost always comes down to utilization - and utilization is a function of how well your computation pattern fits the GPU's fundamental design philosophy. That philosophy is radically different from the CPU's. To understand one, you have to understand both, and to understand both, you have to start at the silicon level.

This lesson builds the mental model that makes everything else in GPU programming, kernel optimization, and distributed training make sense. It is the foundation. Do not skip it.

Why This Exists

Before GPUs, training neural networks on CPUs was the only option. A ResNet-50 forward pass on a modern CPU takes roughly 40-60ms. On an A100 GPU, the same forward pass takes under 1ms. That is a 50-60x speedup from a single architectural shift. The reason for that speedup is not magic - it is the consequence of a deliberately different design philosophy baked into the silicon itself.

CPUs were designed in an era when programs were sequential: do this, then do that, then do this other thing. The primary performance challenge was making each step as fast as possible. GPU design evolved from a completely different problem - rendering millions of pixels simultaneously - where the challenge was not making one computation fast, but making millions of identical computations happen at the same time.

Neural network training is, structurally, much closer to pixel rendering than to sequential computation. A matrix multiply over a 4096-dim embedding does the same multiply-accumulate operation 16 million times, each independently. The GPU was built for exactly this.

Historical Context

The story starts not with deep learning but with videogames. In the late 1990s, 3D games needed to transform millions of vertices and shade millions of pixels every frame at 60Hz. CPUs could not keep up. Graphics hardware vendors - primarily NVIDIA, ATI (now AMD), and 3dfx - built dedicated rendering chips that could apply the same shader program to every pixel in parallel.

NVIDIA's GeForce 256 in 1999 was the first chip they called a "GPU". It had fixed-function hardware for vertex transformation and pixel shading. But the key insight came in 2006 with the GeForce 8800 and the G80 architecture. This was the first GPU with fully programmable shader processors - the same hardware unit could run arbitrary code on vertices, pixels, or general data.

NVIDIA called this Compute Unified Device Architecture, or CUDA, and released it in 2007. For the first time, researchers could write general programs that ran on GPU hardware. The first ML breakthrough using CUDA was Raina et al.'s "Large-Scale Deep Unsupervised Learning" in 2009, followed by Krizhevsky and Sutskever's GPU-accelerated neural network work that eventually became AlexNet in 2012.

AlexNet's 2012 ImageNet win was the moment the world noticed. The model trained in 5-6 days on two GTX 580 GPUs. The equivalent CPU training would have taken months. From that point forward, GPU-accelerated deep learning was not an academic curiosity - it was the only viable path to scale.

The architectural choices NVIDIA made in 2006 to serve videogame rendering turned out to be exactly the right choices for matrix algebra. That was not an accident - linear algebra underlies both graphics transformations and neural network computation.

Core Concepts

The CPU Design Philosophy: Latency Optimization

A modern CPU - say, an Intel Core i9 or AMD EPYC - is an engineering marvel optimized for one goal: make any single computation finish as fast as possible. Every design decision flows from that goal.

Out-of-order execution. The CPU looks ahead in the instruction stream and reorders operations to avoid stalls. If instruction 5 can execute before instruction 4's result is ready, it does. A high-end CPU can track hundreds of in-flight instructions simultaneously.

Branch prediction. Conditional branches (if/else, loops) are everywhere in general-purpose code. The CPU predicts which branch will be taken and speculatively executes it. Modern predictors are right over 95% of the time. When they are wrong, the CPU flushes the pipeline and restarts - expensive, but rare enough to be worthwhile.

Large, multi-level caches. L1 cache is 32-64KB, extremely fast (4-cycle access). L2 is 256KB-1MB (12 cycles). L3 can be 32-96MB (40-60 cycles). These caches exist to hide the ~200-cycle latency of going to main memory. A CPU's area budget allocates roughly 50-60% of die space to cache.

Deep pipelines. Modern CPUs have 14-20 stage pipelines that enable high clock frequencies (3-5 GHz). Each stage does a small piece of work, and many instructions can be in the pipeline simultaneously.

The result: a CPU core can execute a complex, branching, data-dependent computation with minimal latency. Give it a tight loop that processes 1 million elements serially, and each element flies through in nanoseconds.

But here is the constraint: a high-end CPU has 8-64 cores. The transistor budget that went into cache, branch predictor, and out-of-order machinery cannot also go into more compute units.

CPU Die Area Budget (approximate, modern server CPU):
- Cores (execution units):    ~20-25%
- L3 Cache:                   ~40-50%
- Memory controllers / IO:    ~10-15%
- Branch predictors, OOO:     ~10-15%
- Misc (power mgmt, etc):     ~5-10%

A core-heavy CPU trades cache and control logic for raw compute throughput. But even then, you get dozens of cores, not thousands.

The GPU Design Philosophy: Throughput Optimization

A GPU starts from a different question: what if we do not care how long any single computation takes, as long as the total throughput across all computations is maximized?

This leads to completely different silicon decisions.

Thousands of simple cores. An H100 SXM5 has 16,896 CUDA cores (FP32) organized into 132 Streaming Multiprocessors (SMs). Each CUDA core is simple - it can do a multiply-add, but it has no branch predictor, no out-of-order logic, minimal cache. The transistors that would have gone into control hardware instead went into more execution units.

No branch prediction. When a branch diverges across threads in a warp (we will cover warps in depth in the next lesson), the GPU serializes both paths. This is catastrophic for branch-heavy code. It is irrelevant for matrix math, which has almost no branches.

Small caches, enormous bandwidth. An H100's L2 cache is 50MB - respectable in absolute terms but tiny relative to the number of cores it serves. What the GPU has instead of cache is bandwidth: HBM3 memory at 3.35 TB/s. Where a CPU hides latency with cache, a GPU hides it with parallelism - by running thousands of threads, it keeps the execution units busy while some threads wait for memory.

Warp execution. The GPU executes threads in groups of 32 called warps. All 32 threads in a warp execute the same instruction simultaneously on different data. This is SIMT - Single Instruction, Multiple Threads. It is similar to SIMD (Single Instruction, Multiple Data) on CPUs, but applied at a much larger scale and with thread-level granularity.

GPU Die Area Budget (approximate, H100 SXM5):
- CUDA Cores + Tensor Cores:  ~40-50%
- HBM memory stacks (3D):     ~30-35%
- L2 cache + shared memory:   ~8-12%
- NVLink / PCIe interfaces:   ~5-8%
- Control logic + misc:       ~5-8%

The GPU sacrifices single-thread performance for massive throughput across thousands of threads.

SIMT: Single Instruction, Multiple Threads

SIMT is the execution model that makes GPUs work. Understanding it precisely prevents a lot of confusion.

In a CPU SIMD unit (like AVX-512), one instruction operates on a vector of data (e.g., 16 floats at once). All 16 operations are truly simultaneous, and the programmer explicitly manages vector registers.

In GPU SIMT, threads are the unit of work. Each thread has its own register file and its own program counter, giving the illusion of independent execution. But threads are grouped into warps of 32, and within a warp, all 32 threads execute the same instruction on the same clock cycle - just with different data (different registers).

From the programmer's perspective: write code for one thread, launch millions of copies, the hardware runs them in groups of 32.

SIMT Execution (one warp, 32 threads):

Clock 1:  [t0: A[0]*B[0]]  [t1: A[1]*B[1]]  ... [t31: A[31]*B[31]]
Clock 2:  [t0: C[0]+=res]  [t1: C[1]+=res]  ... [t31: C[31]+=res]
Clock 3:  [t0: load A[32]] [t1: load A[33]] ... [t31: load A[63]]

All 32 threads execute the same instruction each cycle,
but on different data (different values of i in A[i]).

This works perfectly for matrix multiply because every element follows the same code path: load A, load B, multiply, accumulate, store. No divergence.

It breaks down for code like:

# Bad for GPU - branch per thread
if i % 2 == 0:
    result[i] = heavy_computation_A(data[i])
else:
    result[i] = heavy_computation_B(data[i])

Within a warp, threads 0, 2, 4... take the even branch; threads 1, 3, 5... take the odd branch. The GPU must execute both branches for the whole warp, masking off the threads not taking each branch. Throughput halves. This is warp divergence, covered in detail in lesson 5.

Why Matrix Multiply Maps Perfectly to GPU

Matrix multiplication $C = A \times B$ for matrices of shape $(M, K)$ and $(K, N)$ :

$C[i,j] = \sum_{k=0}^{K} A[i,k] \cdot B[k,j]$

Every output element $C[i,j]$ is independent. The computation for $C[0,0]$ does not affect $C[1,1]$ . There are $M \times N$ independent dot products, each requiring $K$ multiply-accumulate operations.

For a typical LLM weight matrix - say, a 4096x4096 linear layer - that is 16.7 million independent output elements. Each can run on a separate thread. With 16,896 CUDA cores and Tensor Cores that process 4x4 blocks, an H100 can process entire output tiles simultaneously.

The computation is also regular: same number of FLOPs per output element ( $2K$ FLOPs), same memory access pattern, no data dependencies between outputs. This regularity is exactly what SIMT needs.

Let us look at what happens in Python:

import numpy as np
import time

# Matrix sizes representative of a transformer FFN
M, K, N = 4096, 4096, 4096

A = np.random.randn(M, K).astype(np.float32)
B = np.random.randn(K, N).astype(np.float32)

# CPU matmul (uses highly optimized BLAS, but still CPU-bound)
start = time.perf_counter()
for _ in range(10):
    C_cpu = np.dot(A, B)
cpu_time = (time.perf_counter() - start) / 10

print(f"CPU matmul: {cpu_time*1000:.1f} ms")
print(f"CPU TFLOPS: {2*M*K*N / cpu_time / 1e12:.2f}")

# On a modern 16-core CPU: ~15-25 ms, ~0.05-0.08 TFLOPS
# On an A100 GPU via PyTorch: ~0.3-0.5 ms, ~2.5-4 TFLOPS

import torch
import time

device = torch.device("cuda")
A = torch.randn(4096, 4096, dtype=torch.float32, device=device)
B = torch.randn(4096, 4096, dtype=torch.float32, device=device)

# Warmup
for _ in range(3):
    C = torch.mm(A, B)
torch.cuda.synchronize()

# Benchmark
start = time.perf_counter()
for _ in range(50):
    C = torch.mm(A, B)
torch.cuda.synchronize()  # Wait for GPU to finish
gpu_time = (time.perf_counter() - start) / 50

flops = 2 * 4096 * 4096 * 4096
tflops = flops / gpu_time / 1e12
print(f"GPU matmul: {gpu_time*1000:.2f} ms")
print(f"GPU TFLOPS: {tflops:.2f}")
# A100: ~0.4 ms, ~34 TFLOPS (fp32)
# H100: ~0.2 ms, ~67 TFLOPS (fp32)

The GPU is ~50-100x faster for this operation. But notice torch.cuda.synchronize() - GPU execution is asynchronous. Without this call, your timer measures kernel launch time, not execution time. This is a common measurement error.

When CPUs Outperform GPUs

GPUs are not universally faster. The overhead of using one is real:

PCIe transfer latency: Moving data from CPU RAM to GPU VRAM over PCIe Gen4 x16 takes ~10-20 microseconds minimum, plus ~12 GB/s bandwidth. A small tensor (say, a 1024-element vector) costs more in transfer than the GPU saves in compute time.
Kernel launch overhead: Each CUDA kernel call has ~5-10 microsecond overhead. A workload that launches 10,000 tiny kernels spends more time on overhead than compute.
Sequential dependencies: If step 2 requires step 1's output before it can start, and each step is short, the GPU spends most of its time idle waiting. CPUs handle sequential pipelines with low latency.

Concrete scenarios where CPU wins:

Batch size 1 inference with a small model (the compute is too little to amortize overhead)
Tree-based models (XGBoost, random forests) - highly branchy, not parallelizable the same way
Preprocessing pipelines with complex Python logic between GPU calls
Any operation under ~1ms of compute time on a GPU core

import torch
import time

# Benchmark small vs large batch on GPU vs CPU

def benchmark(device, batch_size, n_iter=1000):
    model = torch.nn.Linear(256, 256).to(device)
    x = torch.randn(batch_size, 256, device=device)

    if device.type == 'cuda':
        # Warmup
        for _ in range(10):
            _ = model(x)
        torch.cuda.synchronize()

    start = time.perf_counter()
    for _ in range(n_iter):
        _ = model(x)
    if device.type == 'cuda':
        torch.cuda.synchronize()
    elapsed = (time.perf_counter() - start) / n_iter
    return elapsed * 1e6  # microseconds

cpu = torch.device('cpu')
gpu = torch.device('cuda')

for bs in [1, 8, 64, 512, 4096]:
    cpu_us = benchmark(cpu, bs)
    gpu_us = benchmark(gpu, bs)
    winner = "GPU" if gpu_us < cpu_us else "CPU"
    print(f"batch={bs:5d}: CPU={cpu_us:7.1f}us  GPU={gpu_us:7.1f}us  winner={winner}")

# Typical results:
# batch=    1: CPU=    8.2us  GPU=   45.1us  winner=CPU
# batch=    8: CPU=   14.3us  GPU=   47.2us  winner=CPU
# batch=   64: CPU=   45.1us  GPU=   48.9us  winner=CPU
# batch=  512: CPU=  312.4us  GPU=   51.3us  winner=GPU
# batch= 4096: CPU= 2341.8us  GPU=   89.7us  winner=GPU

The GPU wins only when the batch is large enough to amortize its overhead. This is why batch size is not just a training hyperparameter - it is a hardware utilization decision.

Architecture Comparison: Visual

Arithmetic Intensity and the Roofline Model (Preview)

Understanding when to use GPU vs CPU also comes down to arithmetic intensity - the ratio of compute operations to memory bytes accessed:

$\text{Arithmetic Intensity} = \frac{\text{FLOPs}}{\text{Bytes accessed}}$

For matrix multiply with dimensions $(M, K) \times (K, N)$ :

FLOPs: $2MKN$
Bytes (fp32, naive): $4(MK + KN + MN)$
Arithmetic intensity: $\frac{2MKN}{4(MK + KN + MN)}$

For large square matrices where $M = K = N = 4096$ : $\text{AI} = \frac{2 \times 4096^3}{4 \times 3 \times 4096^2} = \frac{2 \times 4096}{12} \approx 682 \text{ FLOPs/byte}$

This is very high. The H100's arithmetic intensity limit is roughly 250 FLOPs/byte (ridge point). Large matmuls are firmly compute-bound, not memory-bound - the GPU can fully utilize its FP32 units without stalling on memory.

Operations with low arithmetic intensity (element-wise ops, layer norm, simple attention masks) are memory-bound. They do not benefit from tensor cores and their GPU speedup over CPU is much smaller. We cover the roofline model in depth in Lesson 6.

Production Engineering Notes

Always synchronize before timing GPU operations. torch.cuda.synchronize() or cudaDeviceSynchronize() flushes the CUDA stream. Without it, you measure kernel scheduling time, not actual execution.

Profile before optimizing. Use torch.utils.benchmark or NVIDIA Nsight to measure actual GPU utilization. Many ML codebases have significant CPU bottlenecks in the data pipeline that starve the GPU. The GPU shows 80% utilization in nvidia-smi but is actually idle 40% of the time waiting for CPU preprocessing.

import torch.utils.benchmark as benchmark

# Proper GPU benchmarking
t = benchmark.Timer(
    stmt='torch.mm(A, B)',
    globals={'A': A, 'B': B},
    num_threads=1,
    label='Matrix multiply 4096x4096',
    description='A100 fp32'
)
result = t.blocked_autorange(min_run_time=1.0)
print(result)
# Handles warmup, synchronization, and statistics automatically

Understand the PCIe bottleneck. Moving a 1GB tensor from CPU to GPU over PCIe Gen4 x16 (16 GB/s peak, ~12 GB/s real) takes ~83ms. If your training loop copies data every step, you can easily bottleneck here. Use pinned memory (torch.pin_memory()) and async transfers (non_blocking=True) to overlap data transfer with compute.

# Slow: blocking transfer, train loop waits
batch = batch.cuda()
output = model(batch)

# Better: async transfer with pinned memory
# (in DataLoader: pin_memory=True)
batch = batch.cuda(non_blocking=True)
# GPU can start on previous batch while this transfer happens
output = model(batch)

Know your GPU's actual peak throughput. Spec sheet numbers are theoretical maxima. Real throughput depends on:

Memory access patterns (coalesced vs scattered)
Arithmetic intensity of your kernel
Warp occupancy (how many warps are active per SM)
Whether you are using Tensor Cores (requires specific shapes/dtypes)

An H100 has 989 TFLOPS of fp8 Tensor Core performance. A typical transformer training run achieves 30-50% of that - ~300-500 TFLOPS. Understanding the gap between spec and reality is the entire job of GPU optimization.

Common Mistakes

:::warning GPUs Are Not Always Faster Small models, small batch sizes, and highly sequential operations often run faster on CPU. The overhead of GPU kernel launches, PCIe data transfer, and GPU driver startup can exceed the computation time for small workloads. Measure before assuming GPU is faster - especially for batch size 1 inference. :::

:::danger Forgetting cuda.synchronize() in Benchmarks GPU operations execute asynchronously. Code like this measures kernel launch time, not execution time:

start = time.time()
result = torch.mm(A, B)
elapsed = time.time() - start  # WRONG - this is ~10 microseconds regardless of matrix size

Always add torch.cuda.synchronize() before stopping the timer, or use torch.utils.benchmark.Timer which handles this correctly. :::

:::warning Not Accounting for Transfer Overhead in End-to-End Benchmarks Benchmarking only the GPU kernel while ignoring CPU-to-GPU data transfer gives misleading results. In production, your data starts on CPU (or disk). The full pipeline - load, preprocess, transfer, compute, transfer back - is what determines real throughput. :::

Interview Questions

Q1: Why is matrix multiplication better suited to GPUs than sequential code like a Python for-loop?

Matrix multiplication has two properties that map perfectly to GPU execution. First, every output element $C[i,j]$ is independent of every other output element - they can all be computed in parallel. With 16,384 CUDA cores, you can compute 16,384 elements simultaneously. A Python for-loop is inherently sequential: element $i+1$ runs after element $i$ completes.

Second, matrix multiply has high arithmetic intensity. For large matrices, each byte loaded from memory is used in many multiply-accumulate operations. This keeps the GPU execution units busy and prevents memory bandwidth from becoming the bottleneck. Sequential for-loops typically do one operation per loaded element - arithmetic intensity of ~0.25 FLOPs/byte for a simple copy, far too low to sustain GPU throughput.

Q2: What is SIMT execution and how does it differ from SIMD?

SIMD (Single Instruction, Multiple Data) is a CPU feature where one instruction operates on a vector of values stored in wide registers (e.g., AVX-512 operates on 16 float32 values at once). The programmer explicitly manages vector registers and the hardware processes them as a unit.

SIMT (Single Instruction, Multiple Threads) is the GPU model. The programmer writes code for a single thread, launches many threads, and the hardware groups them into warps of 32. Within a warp, all 32 threads execute the same instruction on the same clock cycle, but each thread has its own register file with its own data. From the programmer's view, threads are independent; from the hardware's view, they move in lockstep. This abstraction makes GPU programming easier (write a scalar program, launch billions of instances) while still achieving data parallelism. The key difference: SIMD is explicit vector programming; SIMT is implicit parallelism via independent thread semantics.

Q3: Name three scenarios where a CPU would outperform a GPU for an ML workload.

Batch size 1 real-time inference with a small model: The GPU kernel launch overhead (~5-10us) plus PCIe transfer time exceeds the actual compute time. A small linear network doing inference on one input at a time runs faster on CPU where there is no transfer overhead and no kernel launch cost.
Tree-based model inference (XGBoost, Random Forest): Tree traversal is a sequential, branch-heavy operation. Each sample follows a unique path through the tree depending on its feature values. Warp divergence makes this catastrophic on GPU - half the threads in a warp will be masked off at each branch. CPUs with branch prediction handle this natively.
Complex preprocessing logic with many small operations: If your pipeline does tokenization, padding, masking, and various string operations before model inference, each of these may involve Python loops and irregular data access. The overhead of scheduling each as a GPU kernel, including data transfer, dominates. It is faster to do all preprocessing on CPU and transfer only the final prepared tensor.

Q4: An H100 is listed at 989 TFLOPS (fp8). A realistic training job achieves 400 TFLOPS. What are the main reasons for the gap?

The gap between spec and reality comes from several factors:

Memory bandwidth bottlenecks: Many operations are memory-bound (not compute-bound). Layer norm, dropout, attention masking, and embedding lookups are all bandwidth-limited. The GPU FP8 units sit idle waiting for data.
Low arithmetic intensity operations: Not all operations in a training loop have the high arithmetic intensity needed to fully utilize tensor cores. Optimizer updates, activation checkpointing, and gradient accumulation have low intensity.
Warp occupancy below 100%: If your batch size or tensor shapes don't fill SMs completely, some warps are idle.
Kernel launch overhead and CUDA stream serialization: Sequential kernel launches, synchronization points, and un-fused operations introduce latency that reduces effective utilization.
Precision mismatch: The 989 TFLOPS spec is for fp8 Tensor Cores. If your model runs in bf16 or fp32, the applicable peak is lower (312 TFLOPS for bf16, 67 TFLOPS for fp32).

Q5: What is the roofline model and how does it help you decide whether to optimize memory access vs compute for a given GPU kernel?

The roofline model plots achievable performance (GFLOPS) as a function of arithmetic intensity (FLOPs/byte). It has two regions:

Memory-bound: arithmetic intensity is below the ridge point. Performance scales linearly with arithmetic intensity. Doubling compute speed helps nothing; improving memory bandwidth or access patterns helps a lot.
Compute-bound: arithmetic intensity is above the ridge point. Performance is capped by peak compute. Improving memory access helps nothing; reducing FLOPs or using faster math (tensor cores, fp8) helps.

The ridge point for an H100 is roughly FP32 peak (67 TFLOPS) / HBM3 bandwidth (3.35 TB/s) = ~20 FLOPs/byte.

A layer norm kernel has arithmetic intensity ~1-2 FLOPs/byte - deeply memory-bound. Fusing it with adjacent operations (as FlashAttention does) to increase its effective arithmetic intensity is the right optimization. A large matmul has AI ~200-700 FLOPs/byte - firmly compute-bound. Switching from fp32 to bf16 tensor cores doubles effective throughput by reducing FLOPs/cycle.

Summary

CPUs are optimized for latency - making single operations fast via deep caches, out-of-order execution, and branch prediction. GPUs are optimized for throughput - running thousands of simple threads simultaneously via SIMT execution, sacrificing single-thread performance for massive parallelism.

Matrix multiplication, the core operation of deep learning, has the properties GPUs need: millions of independent operations, regular memory access, and high arithmetic intensity. This is why GPU-accelerated deep learning became the dominant paradigm after 2012.

The key mental model: when you write a GPU kernel, you write code for one thread. The hardware runs millions of copies in groups of 32 (warps), all executing the same instruction on different data. Any divergence within a warp serializes execution. Any operation too small to fill the GPU with work loses to CPU overhead.

In the next lesson, we go inside a single GPU core - the Streaming Multiprocessor - to understand exactly how threads are scheduled, how occupancy works, and what limits the performance of a kernel.

Why This Exists​

Historical Context​

Core Concepts​

The CPU Design Philosophy: Latency Optimization​

The GPU Design Philosophy: Throughput Optimization​

SIMT: Single Instruction, Multiple Threads​

Why Matrix Multiply Maps Perfectly to GPU​

When CPUs Outperform GPUs​

Architecture Comparison: Visual​

Arithmetic Intensity and the Roofline Model (Preview)​

Production Engineering Notes​

Common Mistakes​

Interview Questions​

Summary​