Module 01: Transformer Architecture
Every large language model you use today - GPT-4, Claude, Gemini, LLaMA - runs on the same foundation: the transformer. Understanding it deeply is not optional for ML engineers. It is the table stakes.
This module tears the transformer apart, component by component. Not just "what it does" but why it was designed that way, what came before and failed, and how each piece fits into the whole. By the end, you will be able to implement a transformer from scratch and explain every design decision in an interview.
What You Will Learn
After this module, you will be able to:
- Explain why RNNs failed at scale and what the transformer replaced
- Implement self-attention and multi-head attention from scratch in NumPy
- Derive the scaled dot-product attention formula from first principles
- Compare sinusoidal, learned, and RoPE positional encodings
- Decide whether to use encoder-only, decoder-only, or encoder-decoder
- Build BPE tokenization from scratch
- Read and interpret scaling law papers (Kaplan 2020, Chinchilla 2022)
- Answer every common transformer interview question with depth
Architecture Map
Lesson Map
| # | Lesson | What You Learn |
|---|---|---|
| 01 | Attention Is All You Need | The 2017 paper - background, motivation, full architecture overview |
| 02 | Self-Attention Mechanism | Q/K/V, scaled dot-product, NumPy from scratch |
| 03 | Multi-Head Attention | Parallel heads, concatenation, parameter counting |
| 04 | Positional Encoding | Sinusoidal, learned, RoPE, ALiBi - when to use what |
| 05 | Feed-Forward Layers | FFN role, GeLU vs SwiGLU, FFN as memory |
| 06 | Layer Norm and Residuals | Skip connections, pre-norm vs post-norm, RMSNorm |
| 07 | Encoder vs Decoder vs Encoder-Decoder | BERT, GPT, T5 - architecture choice guide |
| 08 | Tokenization Deep Dive | BPE from scratch, byte-level BPE, tokenizer artifacts |
| 09 | Embedding Spaces | Lookup tables, weight tying, anisotropy, t-SNE |
| 10 | Scaling Laws | Kaplan (2020), Chinchilla (2022), emergent abilities |
Prerequisites
- Basic understanding of neural networks (forward pass, backpropagation)
- Python proficiency - you should be comfortable reading NumPy and PyTorch code
- Linear algebra fundamentals (matrix multiplication, dot products)
- Some exposure to NLP (tokens, vocabulary, language models)
Key Concepts at a Glance
| Concept | Quick Definition |
|---|---|
| Attention | A mechanism to compute a weighted sum of values, where weights depend on query-key similarity |
| Self-attention | Attention where Q, K, V all come from the same sequence |
| d_model | The embedding dimension - typically 512, 768, 1024, or 4096 |
| d_k | Key/query dimension per head - usually d_model / num_heads |
| Positional encoding | A vector added to embeddings to inject position information |
| Layer normalization | Normalizing activations across features (not the batch dimension) |
| Residual connection | Adding the input of a sublayer to its output: x + Sublayer(x) |
| Causal mask | A mask that prevents tokens from attending to future tokens |
| BPE | Byte-Pair Encoding - the dominant tokenization algorithm |
| Scaling laws | Power-law relationships between model performance and N, D, C |
:::tip Start here if you're new to transformers Lesson 01 gives you the full picture before you dive into the details. Even if you've "heard of transformers before", read it - the historical context will make every other lesson click faster. :::
