Attention Is All You Need
The 2017 Vaswani et al. paper that replaced recurrent networks with pure attention - and why it changed everything.
The 2017 Vaswani et al. paper that replaced recurrent networks with pure attention - and why it changed everything.
How token embeddings form a dense vector space that captures semantic meaning - geometry, anisotropy, weight tying, and visualization.
Comparing encoder-only, decoder-only, and encoder-decoder transformer architectures - when to use each and why decoder-only won.
The role of position-wise feed-forward networks in transformers - from the basic FFN to SwiGLU and Mixture of Experts.
How layer normalization and residual connections solve gradient flow in deep transformers and enable training of 100+ layer networks.
A complete guide to the transformer architecture - the foundation of every modern large language model.
How multi-head attention enables transformers to jointly attend to information from multiple representation subspaces simultaneously.
How positional encodings inject sequence order information into transformers - from sinusoidal to RoPE.
Empirical power-law relationships between LLM performance and compute, data, and parameters - from Kaplan (2020) to Chinchilla (2022) and beyond.
How self-attention computes query, key, and value interactions to capture long-range dependencies between tokens.
How tokenizers convert raw text to token IDs - BPE from scratch, WordPiece, byte-level BPE, and the surprising ways tokenization breaks models.