11 docs tagged with "transformer-architecture"

Attention Is All You Need

The 2017 Vaswani et al. paper that replaced recurrent networks with pure attention - and why it changed everything.

How token embeddings form a dense vector space that captures semantic meaning - geometry, anisotropy, weight tying, and visualization.

Comparing encoder-only, decoder-only, and encoder-decoder transformer architectures - when to use each and why decoder-only won.

The role of position-wise feed-forward networks in transformers - from the basic FFN to SwiGLU and Mixture of Experts.

How layer normalization and residual connections solve gradient flow in deep transformers and enable training of 100+ layer networks.

A complete guide to the transformer architecture - the foundation of every modern large language model.

How multi-head attention enables transformers to jointly attend to information from multiple representation subspaces simultaneously.

How positional encodings inject sequence order information into transformers - from sinusoidal to RoPE.

Empirical power-law relationships between LLM performance and compute, data, and parameters - from Kaplan (2020) to Chinchilla (2022) and beyond.

How self-attention computes query, key, and value interactions to capture long-range dependencies between tokens.

How tokenizers convert raw text to token IDs - BPE from scratch, WordPiece, byte-level BPE, and the surprising ways tokenization breaks models.