Module 01: Transformer Architecture

Every large language model you use today - GPT-4, Claude, Gemini, LLaMA - runs on the same foundation: the transformer. Understanding it deeply is not optional for ML engineers. It is the table stakes.

This module tears the transformer apart, component by component. Not just "what it does" but why it was designed that way, what came before and failed, and how each piece fits into the whole. By the end, you will be able to implement a transformer from scratch and explain every design decision in an interview.

What You Will Learn

After this module, you will be able to:

  - Explain why RNNs failed at scale and what the transformer replaced
  - Implement self-attention and multi-head attention from scratch in NumPy
  - Derive the scaled dot-product attention formula from first principles
  - Compare sinusoidal, learned, and RoPE positional encodings
  - Decide whether to use encoder-only, decoder-only, or encoder-decoder
  - Build BPE tokenization from scratch
  - Read and interpret scaling law papers (Kaplan 2020, Chinchilla 2022)
  - Answer every common transformer interview question with depth

Architecture Map

Lesson Map

#	Lesson	What You Learn
01	Attention Is All You Need	The 2017 paper - background, motivation, full architecture overview
02	Self-Attention Mechanism	Q/K/V, scaled dot-product, NumPy from scratch
03	Multi-Head Attention	Parallel heads, concatenation, parameter counting
04	Positional Encoding	Sinusoidal, learned, RoPE, ALiBi - when to use what
05	Feed-Forward Layers	FFN role, GeLU vs SwiGLU, FFN as memory
06	Layer Norm and Residuals	Skip connections, pre-norm vs post-norm, RMSNorm
07	Encoder vs Decoder vs Encoder-Decoder	BERT, GPT, T5 - architecture choice guide
08	Tokenization Deep Dive	BPE from scratch, byte-level BPE, tokenizer artifacts
09	Embedding Spaces	Lookup tables, weight tying, anisotropy, t-SNE
10	Scaling Laws	Kaplan (2020), Chinchilla (2022), emergent abilities

Prerequisites

Basic understanding of neural networks (forward pass, backpropagation)
Python proficiency - you should be comfortable reading NumPy and PyTorch code
Linear algebra fundamentals (matrix multiplication, dot products)
Some exposure to NLP (tokens, vocabulary, language models)

Key Concepts at a Glance

Concept	Quick Definition
Attention	A mechanism to compute a weighted sum of values, where weights depend on query-key similarity
Self-attention	Attention where Q, K, V all come from the same sequence
d_model	The embedding dimension - typically 512, 768, 1024, or 4096
d_k	Key/query dimension per head - usually d_model / num_heads
Positional encoding	A vector added to embeddings to inject position information
Layer normalization	Normalizing activations across features (not the batch dimension)
Residual connection	Adding the input of a sublayer to its output: x + Sublayer(x)
Causal mask	A mask that prevents tokens from attending to future tokens
BPE	Byte-Pair Encoding - the dominant tokenization algorithm
Scaling laws	Power-law relationships between model performance and N, D, C

:::tip Start here if you're new to transformers Lesson 01 gives you the full picture before you dive into the details. Even if you've "heard of transformers before", read it - the historical context will make every other lesson click faster. :::

What You Will Learn​

Architecture Map​

Lesson Map​

Prerequisites​

Key Concepts at a Glance​

What You Will Learn

Architecture Map

Lesson Map

Prerequisites

Key Concepts at a Glance