Skip to main content

Module 01: Transformer Architecture

Every large language model you use today - GPT-4, Claude, Gemini, LLaMA - runs on the same foundation: the transformer. Understanding it deeply is not optional for ML engineers. It is the table stakes.

This module tears the transformer apart, component by component. Not just "what it does" but why it was designed that way, what came before and failed, and how each piece fits into the whole. By the end, you will be able to implement a transformer from scratch and explain every design decision in an interview.


What You Will Learn

After this module, you will be able to:

- Explain why RNNs failed at scale and what the transformer replaced
- Implement self-attention and multi-head attention from scratch in NumPy
- Derive the scaled dot-product attention formula from first principles
- Compare sinusoidal, learned, and RoPE positional encodings
- Decide whether to use encoder-only, decoder-only, or encoder-decoder
- Build BPE tokenization from scratch
- Read and interpret scaling law papers (Kaplan 2020, Chinchilla 2022)
- Answer every common transformer interview question with depth

Architecture Map


Lesson Map

#LessonWhat You Learn
01Attention Is All You NeedThe 2017 paper - background, motivation, full architecture overview
02Self-Attention MechanismQ/K/V, scaled dot-product, NumPy from scratch
03Multi-Head AttentionParallel heads, concatenation, parameter counting
04Positional EncodingSinusoidal, learned, RoPE, ALiBi - when to use what
05Feed-Forward LayersFFN role, GeLU vs SwiGLU, FFN as memory
06Layer Norm and ResidualsSkip connections, pre-norm vs post-norm, RMSNorm
07Encoder vs Decoder vs Encoder-DecoderBERT, GPT, T5 - architecture choice guide
08Tokenization Deep DiveBPE from scratch, byte-level BPE, tokenizer artifacts
09Embedding SpacesLookup tables, weight tying, anisotropy, t-SNE
10Scaling LawsKaplan (2020), Chinchilla (2022), emergent abilities

Prerequisites

  • Basic understanding of neural networks (forward pass, backpropagation)
  • Python proficiency - you should be comfortable reading NumPy and PyTorch code
  • Linear algebra fundamentals (matrix multiplication, dot products)
  • Some exposure to NLP (tokens, vocabulary, language models)

Key Concepts at a Glance

ConceptQuick Definition
AttentionA mechanism to compute a weighted sum of values, where weights depend on query-key similarity
Self-attentionAttention where Q, K, V all come from the same sequence
d_modelThe embedding dimension - typically 512, 768, 1024, or 4096
d_kKey/query dimension per head - usually d_model / num_heads
Positional encodingA vector added to embeddings to inject position information
Layer normalizationNormalizing activations across features (not the batch dimension)
Residual connectionAdding the input of a sublayer to its output: x + Sublayer(x)
Causal maskA mask that prevents tokens from attending to future tokens
BPEByte-Pair Encoding - the dominant tokenization algorithm
Scaling lawsPower-law relationships between model performance and N, D, C

:::tip Start here if you're new to transformers Lesson 01 gives you the full picture before you dive into the details. Even if you've "heard of transformers before", read it - the historical context will make every other lesson click faster. :::

© 2026 EngineersOfAI. All rights reserved.