What is tokenization?

How tokenizers convert raw text to token IDs - BPE from scratch, WordPiece, byte-level BPE, and the surprising ways tokenization breaks models.

How does byte pair encoding BPE work in practice?

Tokenization Deep Dive covers tokenization, byte pair encoding BPE, WordPiece tokenizer from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/transformer-architecture/tokenization-deep-dive

What is the difference between tokenization and WordPiece tokenizer?

See the full breakdown at https://engineersofai.com/docs/llms/transformer-architecture/tokenization-deep-dive

Tokenization Deep Dive

Reading time: ~40 min · Interview relevance: High · Target roles: ML Engineer, AI Engineer, Research Engineer

The Bug in Production at 3AM

The on-call engineer got paged at 3:17 AM. The LLM-powered customer support system had started producing responses that were partially garbled - mixing English words with strange, seemingly random substrings. The model itself hadn't changed. The only thing that had changed was a patch to the text preprocessing pipeline.

The patch was supposed to strip HTML from customer messages before sending to the API. It did - but it also stripped something else: it was converting certain Unicode characters (specifically, em-dashes "-" and curly quotes "' '") to their ASCII equivalents.

The tokenizer had been trained on raw internet text that included em-dashes and curly quotes extensively. When the model was fine-tuned for customer support, those characters appeared in the training data. When production preprocessing converted them to ASCII hyphens and straight quotes, the resulting token IDs were completely different. The model had learned associations between specific token sequences and appropriate responses - and now those sequences were broken.

The model was getting the right words, but the wrong tokens. In the model's "mind", an em-dash and a hyphen are completely different inputs, as different as "cat" and "dxt".

This is the dark side of tokenization: it's not just a preprocessing step. It is a fundamental architectural choice that bakes specific assumptions into the model. Change the tokenizer, and you change the model's behavior, even if the "text" is the same.

What Is a Token?

A token is the atomic unit of input to a language model. It is not a word. It is not a character. It is not a sentence. It is a chunk of text defined by the tokenizer.

Depending on the tokenizer:

"cat" might be 1 token
"unbelievable" might be 3 tokens: "un", "believ", "able"
"的" (a common Chinese character) might be 1 token in a multilingual model or 3 bytes in a byte-level model
" " (5 spaces) might be 1 token (GPT tokenizers merge whitespace)
"SolidGoldMagikarp" might be 1 token (it appeared as a single unit in training data)

The model processes tokens as atomic units. It has no built-in notion of character-level structure - "cat" and "bat" are not more similar to the model than "cat" and "xyz", unless the model learned token co-occurrence patterns during training.

This has profound implications:

Arithmetic is hard: "99" + "1" = "100" - the model must reason about character-level structure that doesn't exist at the token level
Spelling is hard: spelling out "cat" as "c", "a", "t" requires 3 tokens - the model must reverse engineer character decomposition from token-level patterns
Code is token-aware: Python indentation is whitespace, which tokenizers handle inconsistently. 4 spaces and a tab produce different tokens but mean the same thing syntactically.

The Problem Tokenization Solves

Three approaches to segmenting text, each with fatal flaws:

Character-level tokenization: Each character is a token. Clean, no OOV (out of vocabulary) issues, small vocabulary (~256 for ASCII). Fatal flaw: sequences are very long (average English word is 4-5 characters), making attention quadratically expensive. A 512-word document is ~2,500 characters - already at the limit for most models.

Word-level tokenization: Each word is a token. Short sequences, aligns with human intuition. Fatal flaw: vocabulary explosion. English has 100,000+ words; with inflection forms, technical terms, and code, you need 500K+ tokens. Rare words are "out of vocabulary" (OOV) - they must be treated as [UNK] (unknown). The model cannot handle new words or proper nouns it hasn't seen.

Subword tokenization (BPE, WordPiece, SentencePiece): The middle ground. Split text into subword units - common words stay whole, rare words are split into frequent components. "unbelievable" → ["un", "believ", "able"]. Common parts like "un" and "able" appear in many words and are learned efficiently. Vocabulary size is manageable (30K-100K). No OOV - everything can be split into characters or bytes at worst.

Subword tokenization won. Every modern LLM uses some variant.

Byte Pair Encoding (BPE)

BPE was originally a data compression algorithm (Gage, 1994). Sennrich et al. (2016) adapted it for NLP. GPT-2, GPT-3, and GPT-4 use a byte-level variant.

The Algorithm

Step 1: Start with a vocabulary of individual characters (or bytes for byte-level BPE).

Step 2: Represent training corpus as a sequence of characters, with word boundaries marked.

Step 3: Count the frequency of all adjacent character pairs (bigrams).

Step 4: Merge the most frequent bigram into a new symbol. Add this symbol to vocabulary.

Step 5: Repeat steps 3-4 until vocabulary reaches target size (e.g., 50,000 merges).

The merge rules are the "learned vocabulary" - ordered by frequency in training data. To tokenize new text, apply the merge rules in the same order.

BPE Implementation from Scratch

from collections import Counter, defaultdict
from typing import Dict, List, Tuple


def get_vocab(corpus: List[str]) -> Dict[str, int]:
    """
    Convert corpus to word frequency dict.
    Each word is represented as space-separated characters + </w> end marker.
    'cat' -> 'c a t </w>'
    """
    vocab = Counter()
    for text in corpus:
        for word in text.split():
            # Add end-of-word marker, represent as char sequence
            vocab[' '.join(list(word) + ['</w>'])] += 1
    return dict(vocab)


def get_stats(vocab: Dict[str, int]) -> Dict[Tuple[str, str], int]:
    """Count frequency of all symbol pairs in vocabulary."""
    pairs = Counter()
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols) - 1):
            pairs[(symbols[i], symbols[i + 1])] += freq
    return dict(pairs)


def merge_vocab(pair: Tuple[str, str], vocab: Dict[str, int]) -> Dict[str, int]:
    """Merge the best pair in the vocabulary."""
    new_vocab = {}
    # Create a pattern to match the pair (with space boundaries)
    bigram = ' '.join(pair)
    replacement = ''.join(pair)

    for word, freq in vocab.items():
        # Replace the pair in the word representation
        new_word = word.replace(bigram, replacement)
        new_vocab[new_word] = freq

    return new_vocab


def learn_bpe(corpus: List[str], num_merges: int) -> List[Tuple[str, str]]:
    """
    Learn BPE merge operations from corpus.
    Returns the ordered list of merge operations.
    """
    vocab = get_vocab(corpus)
    merges = []

    print(f"Initial vocab size: {len(set(' '.join(vocab.keys()).split()))}")

    for i in range(num_merges):
        pairs = get_stats(vocab)
        if not pairs:
            break

        # Find the most frequent pair
        best_pair = max(pairs, key=pairs.get)
        best_count = pairs[best_pair]

        if best_count < 2:
            break  # No pair appears more than once - stop

        # Merge the best pair
        vocab = merge_vocab(best_pair, vocab)
        merges.append(best_pair)

        if i < 10 or i % 10 == 0:
            print(f"Merge {i+1}: {best_pair} (count={best_count})")

    return merges, vocab


def bpe_encode(word: str, merges: List[Tuple[str, str]]) -> List[str]:
    """
    Encode a single word using learned BPE merges.
    Returns list of subword tokens.
    """
    # Start with character-level representation + end marker
    tokens = list(word) + ['</w>']

    for merge_pair in merges:
        i = 0
        new_tokens = []
        while i < len(tokens):
            if (i < len(tokens) - 1 and
                tokens[i] == merge_pair[0] and
                tokens[i + 1] == merge_pair[1]):
                # Merge this pair
                new_tokens.append(merge_pair[0] + merge_pair[1])
                i += 2
            else:
                new_tokens.append(tokens[i])
                i += 1
        tokens = new_tokens

    return tokens


# Example: learn BPE on a small corpus
corpus = [
    "low lower lowest",
    "new newer newest",
    "wide wider widest",
    "large larger largest",
    "old older oldest",
    "low low low low",
    "new new new",
    "wide wide",
    "slow slower slowest",
    "fast faster fastest",
]

# Learn 15 merge operations
print("=== Learning BPE ===")
merges, final_vocab = learn_bpe(corpus, num_merges=15)

print(f"\nLearned {len(merges)} merge operations:")
for i, merge in enumerate(merges[:10]):
    print(f"  {i+1}. {merge[0]} + {merge[1]} -> {merge[0]+merge[1]}")

# Encode some words
print("\n=== Encoding words ===")
test_words = ["low", "lower", "lowest", "newest", "unknown"]
for word in test_words:
    tokens = bpe_encode(word, merges)
    print(f"  '{word}' -> {tokens}")

WordPiece (BERT's Tokenizer)

WordPiece (Schuster & Nakamura, 2012; used in BERT) is similar to BPE but uses a different merge criterion.

Instead of merging the most frequent pair, WordPiece merges the pair that maximizes the language model likelihood of the training data. It selects the merge that most improves the language model's probability estimate.

Key difference from BPE: WordPiece uses the ## prefix to indicate that a subword is a continuation of the previous one:

"playing" might tokenize to ["play", "##ing"]
"unbelievable" might tokenize to ["un", "##believe", "##able"]

from transformers import BertTokenizer

# BERT-base-uncased tokenizer (WordPiece, ~30K vocab)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

examples = [
    "The quick brown fox jumps over the lazy dog.",
    "unbelievable",
    "tokenization",
    "SolidGoldMagikarp",  # Interesting edge case
    "1234567890",
    "def calculate_loss(logits, labels):",
]

for text in examples:
    tokens = tokenizer.tokenize(text)
    ids = tokenizer.encode(text)
    print(f"\nText: '{text}'")
    print(f"Tokens: {tokens}")
    print(f"Token count: {len(tokens)}, IDs: {ids[:10]}{'...' if len(ids)>10 else ''}")

Byte-Level BPE (GPT-2, GPT-3, GPT-4)

Standard BPE faces a problem: any character not in the base vocabulary becomes [UNK]. Non-ASCII characters (Chinese, Arabic, emoji) that are rare in training data might be unknown.

GPT-2's solution (Radford et al., 2019): byte-level BPE. Instead of starting with characters, start with the 256 possible byte values. Every possible input (any language, any emoji, any byte sequence) can be represented - there are no unknown tokens.

# GPT-2 tokenizer: byte-level BPE with 50,257 tokens
# (50,000 BPE merges + 256 byte tokens + 1 special token)
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

examples = [
    "Hello, world!",
    "SolidGoldMagikarp",        # Famous tokenizer artifact
    "日本語",                    # Japanese - not in most word tokenizers
    "🚀 rocket emoji",
    "    4 spaces of indentation",  # GPT tokenizers are whitespace-aware
    "python\ncode\nhere",        # newlines are tokens
    "1+1=2",
]

for text in examples:
    tokens = tokenizer.encode(text)
    decoded = [tokenizer.decode([t]) for t in tokens]
    print(f"\nText: '{text}'")
    print(f"Token IDs: {tokens}")
    print(f"Decoded tokens: {decoded}")
    print(f"Token count: {len(tokens)}")

# Show that byte-level means no UNK
random_bytes = bytes(range(256))
byte_text = random_bytes.decode('latin-1', errors='replace')
try:
    tokens = tokenizer.encode(byte_text)
    print(f"\nAll 256 bytes encoded successfully: {len(tokens)} tokens (no UNK)")
except Exception as e:
    print(f"Error: {e}")

Vocabulary Size Trade-offs

Vocab Size	Sequence Length	Coverage	Training Cost
~256 (bytes)	Very long	Complete	Low
~5,000	Long	Good	Low
~32,000 (BERT)	Medium	Good for English	Medium
~50,257 (GPT-2)	Medium	Good, byte-level	Medium
~100,000+	Short	Excellent	High
~128,256 (LLaMA-3)	Short	Multilingual	High

Small vocabulary (less than 10K):

Longer sequences (more tokens per word → more attention computation)
More token/character decomposition for rare words
Good for multi-lingual: common byte subwords are shared across languages

Large vocabulary (more than 100K):

Shorter sequences (common words are single tokens)
Rare tokens appear infrequently - model has less training signal for them
Larger embedding table (parameter cost)

LLaMA-3's 128K vocabulary: Meta expanded from LLaMA-2's 32K to 128K for LLaMA-3. This dramatically reduces sequence length for non-English languages (Chinese, Arabic, etc.) which were split into many tokens in the 32K tokenizer. Multilingual tasks benefit significantly.

The "SolidGoldMagikarp" Problem and Tokenizer Artifacts

In 2023, researchers discovered that certain tokens in GPT-2/GPT-3's tokenizer would cause the model to behave erratically - producing garbled output, looping, or refusing to respond.

"SolidGoldMagikarp" was one. When used as a prompt by itself, it would cause GPT models to produce unrelated, strange outputs. Other examples: "petertodd", "cloneembedreportprint", "StreamerBot".

Why does this happen? These strings were tokenized as single tokens in the BPE vocabulary (they appeared as intact strings in training data - subreddit names, usernames, code patterns). However, the string itself almost never appeared in the model's actual training text - it was a token that existed in the vocabulary but had almost no training signal.

When the model encounters a token it has "seen" (it's in the vocabulary) but never seen in meaningful context, its behavior is undefined. The embedding for that token was updated so rarely during training that it may point to an arbitrary location in embedding space.

This is a practical tokenizer hygiene issue:

Tokens with near-zero training frequency are "ghost tokens" - they exist but the model doesn't know what to do with them
High-frequency tokens for non-English scripts may be underrepresented (a token that appears in Chinese but the model was trained mostly on English)
The model's vocabulary must be curated during training data cleaning

Multilingual Tokenization Challenges

For a multilingual model (supporting English, Chinese, Arabic, Hindi, etc.):

Challenge 1: Sequence length disparity. English tokenizes efficiently. Chinese characters, even with BPE, often remain single tokens - but Chinese has many thousands of characters. A Chinese sentence might use 50% more tokens than the equivalent English text. This means:

Chinese users get less context window per "semantic unit"
Chinese inference costs more (more tokens = more FLOPs)

Challenge 2: Script coverage. If 95% of training data is English, rare characters from other scripts may appear infrequently in the vocabulary and have poor embeddings.

Challenge 3: Morphologically rich languages. Finnish, Turkish, and other agglutinative languages combine many morphemes into single words. "Talossanikin" (Finnish, roughly "even in my house") is one word that might tokenize into 5-6 subwords, each with specific morphological meaning. Capturing morphological structure requires a large vocabulary specific to that language.

Solutions: Large vocabularies (100K+), language-balanced training data, character-aware tokenization, or language-specific tokenizers.

SentencePiece (T5, LLaMA)

Google's SentencePiece (Kudo & Richardson, 2018) is a tokenization library that:

Treats the input as a raw character sequence (no pre-tokenization by whitespace)
Can use both BPE and Unigram Language Model algorithms
Supports all languages and scripts uniformly

Key difference from BPE/WordPiece: no whitespace pre-tokenization. Standard BPE splits on spaces first, then applies BPE within words. SentencePiece treats spaces as just another character (represented as ▁). This makes it more uniform across languages - Chinese, which has no spaces between words, is handled identically to English.

# SentencePiece in LLaMA tokenizer
from transformers import LlamaTokenizer  # or AutoTokenizer

# LLaMA uses SentencePiece with 32K vocabulary (LLaMA-1/2)
# LLaMA-3 uses tiktoken with 128K vocabulary

tokenizer = LlamaTokenizer.from_pretrained('huggyllama/llama-7b')

test = ["Hello, world!", "こんにちは", "def add(a, b):\n    return a + b"]
for text in test:
    tokens = tokenizer.tokenize(text)
    print(f"'{text}' -> {tokens}")

Production Engineering Notes

Tokenizer Consistency Is Critical

The tokenizer used for inference must exactly match the tokenizer used for training. Even a single character difference in tokenization logic can change token IDs for entire vocabulary chunks.

Checklist when deploying:

Pinned tokenizer version (use specific commit hash or artifact version)
Validate that tokenizer output for 10+ canonical test strings matches expected token IDs
Handle preprocessing consistently - strip HTML, normalize Unicode before tokenization, using the same library version as training

Token Budget Calculation

LLMs charge by tokens (API cost) and have context limits. How to estimate:

# Rule of thumb: 1 token ≈ 4 characters ≈ 0.75 words for English
# Always measure on your actual data

def estimate_tokens(text: str, chars_per_token: float = 4.0) -> int:
    """Fast estimate - use actual tokenizer for precise counts."""
    return int(len(text) / chars_per_token)

# For production: always tokenize to count exactly
def count_tokens_exact(text: str, tokenizer) -> int:
    return len(tokenizer.encode(text, add_special_tokens=False))

For non-English text, chars_per_token is higher (more bytes per character → more tokens per character).

Special Token Handling

Every tokenizer defines special tokens:

[CLS], [SEP], [PAD], [MASK] - BERT
<s>, </s>, <unk>, <pad> - LLaMA/SentencePiece
<|endoftext|> - GPT-2/GPT-4

Special tokens serve structural roles: [CLS] marks the start of a BERT input for classification (the classifier head reads this token's representation). </s> in generation models indicates the end of output.

Forgetting to add required special tokens (e.g., [CLS] for BERT classification) silently degrades performance - the model's pooling logic is looking at the wrong position.

Common Mistakes

:::danger Reusing a tokenizer from a different model family If you train on tokenized data from Model A's tokenizer and run inference with Model B's tokenizer, the token IDs are completely different. Token ID 5432 in BERT's vocabulary is a different string than token ID 5432 in GPT-2's vocabulary. This is a silent, catastrophic error - the model produces garbage and the error is invisible. :::

:::danger Ignoring the max_length parameter Tokenizers have a maximum sequence length (typically 512 for BERT, 2048 for GPT-2, 4096 for LLaMA-2). Text longer than this must be truncated or chunked. Silently truncating without logging means you may be dropping important information. Always set truncation=True explicitly and log when truncation occurs. :::

:::warning Different tokenizers count tokens differently for the same text GPT-4's tokenizer (tiktoken, cl100k_base vocabulary) may split a word differently than LLaMA's SentencePiece tokenizer. If you're estimating costs (API is billed per token) using the wrong tokenizer, your estimates can be off by 20-50% for non-English text. :::

:::tip Test your tokenizer on pathological inputs Before deploying, test your tokenizer on: empty string, string of only spaces, very long string (1M chars), emoji only, code with special characters, and non-ASCII scripts from your target languages. Edge cases that cause tokenizer errors crash the whole inference pipeline. :::

Interview Q&A

Q1: Why do LLMs use subword tokenization instead of word-level or character-level?

Answer: Each approach has a fatal flaw that subword tokenization avoids:

Word-level: vocabulary must include every word, including all inflections, compound words, and domain-specific terms. English alone has 100K+ forms; technical domains add many more. Rare words become [UNK] - the model cannot handle them at all. Vocabulary of 500K+ words is impractical.

Character-level: every character is a token. No OOV problem, tiny vocabulary. Fatal flaw: sequences are 5× longer than word-level (average English word is 5 characters). For 512-word documents, that's 2,500+ tokens - at the limit of many models' context windows. Attention is $O(n^2)$ , so this is computationally expensive.

Subword (BPE, WordPiece): common words stay as single tokens. Rare words are split into known components. "unbelievable" → ["un", "believ", "able"] - each component appears in many words and is well-trained. Vocabulary of 30K-100K covers 99.9%+ of text. Sequences are shorter than character-level. No OOV problem (at worst, bytes cover everything).

Q2: Explain the BPE algorithm step by step.

Answer: BPE (Byte Pair Encoding) is a bottom-up merging algorithm:

Initialization: Start with the base vocabulary - either individual characters (standard BPE) or the 256 possible bytes (byte-level BPE). Every word in the corpus is represented as a sequence of these base units.
Count pairs: Count the frequency of all adjacent symbol pairs across the entire corpus. For example, "low", "lower", "lowest" might have high frequency for the pair ("l", "o") and ("l", "o", "w").
Merge the best pair: Find the most frequent pair. Create a new symbol by concatenating them. Add to vocabulary. Rewrite the corpus using the new symbol.
Repeat: Steps 2-3 are repeated for $n$ iterations (the number of merge operations, typically 30K-50K).
Apply to new text: Given new text, apply the learned merge operations in order. A word never seen in training can still be tokenized by decomposing it through the merge hierarchy.

The resulting vocabulary contains common words as single tokens and all the subword components needed to construct rare words.

Q3: What is the "tokenization tax" for non-English languages, and why does it matter?

Answer: Models trained predominantly on English text develop tokenizers that are efficient for English but inefficient for other languages.

A GPT-2 tokenizer with 50K vocabulary handles English at ~4 characters per token. The same tokenizer handles Chinese at ~2 characters per token (fewer merges learned for Chinese subwords, more characters left as individual bytes). Japanese, Arabic, and Hindi are similarly inefficient.

Practical consequences:

A Chinese user gets fewer "words" of context per token than an English user
API costs (billed per token) are higher for non-English users
Model performance is lower for languages with high tokenization overhead, because the model has fewer training examples per effective token

Example: "The quick brown fox" is 4 tokens in GPT-4's tokenizer. The equivalent in Chinese ("快速的棕色狐狸") might be 6+ tokens despite being semantically equivalent.

This is why LLaMA-3 expanded to 128K vocabulary - dramatically improving tokenization efficiency for non-English languages by learning better merge sequences for Chinese, Japanese, Korean, and other scripts.

Q4: What happens when you encounter an out-of-vocabulary token, and how do different tokenizers handle it?

Answer: True OOV (unknown tokens) behavior depends on the tokenizer type:

Word-level tokenizers: OOV tokens are replaced with [UNK]. The model gets a single "unknown" embedding, losing all information about the actual word. Completely unrecoverable.

Standard BPE (like BERT): OOV is impossible at the character level - if a word contains unknown characters, they can't be decomposed further. In practice, BERT uses [UNK] for bytes it can't represent with its character set.

Byte-level BPE (GPT-2, GPT-4, LLaMA-2): OOV is structurally impossible. Any text can be represented as a sequence of bytes (0-255), and all 256 bytes are in the base vocabulary. A rare character like "€" might tokenize to multiple byte tokens, but it's never unknown.

Practical implication: For multilingual applications, byte-level BPE is strongly preferred. BERT's WordPiece struggles with languages that contain characters not in its vocabulary.

Q5: Why does tokenization make arithmetic hard for LLMs, and how do they overcome it?

Answer: Tokenization creates a fundamental mismatch between how numbers are represented and how humans think about them.

Consider "1234":

In GPT-2's tokenizer, "1234" is a single token
"12345" might be one or two tokens
"1234" + "1" = "1235" requires understanding digit-level structure that doesn't exist at the token level

The model must learn to "see through" the tokenization to understand number structure. This is hard because:

"1234" as a single token has no inherent relationship to "1235" at the token level
Adding large numbers requires carrying across arbitrary token boundaries
The model has no built-in arithmetic operator - it must learn it from patterns

How LLMs handle it:

Training data: Many examples of arithmetic in training. The model learns patterns like "X + Y = Z" from examples.
Chain-of-thought prompting: Breaking arithmetic into steps (calculate digit by digit) that the model can follow
Tool use: Route arithmetic to a calculator. The model identifies arithmetic tasks and calls a Python interpreter.

At large scale (GPT-4), LLMs can handle multi-digit arithmetic surprisingly well through learned patterns, but they're still unreliable for very large numbers or unusual arithmetic formats - precisely because of tokenization mismatch.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the BPE Tokenisation demo on the EngineersOfAI Playground - no code required.

:::

The Bug in Production at 3AM​

What Is a Token?​

The Problem Tokenization Solves​

Byte Pair Encoding (BPE)​

The Algorithm​

BPE Implementation from Scratch​

WordPiece (BERT's Tokenizer)​

Byte-Level BPE (GPT-2, GPT-3, GPT-4)​

Vocabulary Size Trade-offs​

The "SolidGoldMagikarp" Problem and Tokenizer Artifacts​

Multilingual Tokenization Challenges​

SentencePiece (T5, LLaMA)​

Production Engineering Notes​

Tokenizer Consistency Is Critical​

Token Budget Calculation​

Special Token Handling​

Common Mistakes​

Interview Q&A​

Q1: Why do LLMs use subword tokenization instead of word-level or character-level?​

Q2: Explain the BPE algorithm step by step.​

Q3: What is the "tokenization tax" for non-English languages, and why does it matter?​

Q4: What happens when you encounter an out-of-vocabulary token, and how do different tokenizers handle it?​

Q5: Why does tokenization make arithmetic hard for LLMs, and how do they overcome it?​