Skip to main content

11 docs tagged with "transformer-architecture"

View all tags

Attention Is All You Need

The 2017 Vaswani et al. paper that replaced recurrent networks with pure attention - and why it changed everything.

Embedding Spaces

How token embeddings form a dense vector space that captures semantic meaning - geometry, anisotropy, weight tying, and visualization.

Feed-Forward Layers

The role of position-wise feed-forward networks in transformers - from the basic FFN to SwiGLU and Mixture of Experts.

Multi-Head Attention

How multi-head attention enables transformers to jointly attend to information from multiple representation subspaces simultaneously.

Positional Encoding

How positional encodings inject sequence order information into transformers - from sinusoidal to RoPE.

Scaling Laws

Empirical power-law relationships between LLM performance and compute, data, and parameters - from Kaplan (2020) to Chinchilla (2022) and beyond.

Self-Attention Mechanism

How self-attention computes query, key, and value interactions to capture long-range dependencies between tokens.

Tokenization Deep Dive

How tokenizers convert raw text to token IDs - BPE from scratch, WordPiece, byte-level BPE, and the surprising ways tokenization breaks models.