BERT - Bidirectional Pre-Training That Redefined NLP

Reading time: ~40 min | Interview relevance: Critical | Roles: MLE, AI Eng, Research Engineer, Data Scientist

The Real Interview Moment

You are in a Google NLP team interview. The senior researcher asks: "BERT uses masked language modeling. Why not just use a standard language model like GPT? What does bidirectionality buy you, and what does it cost?" You start to answer, and she immediately follows up: "Why does BERT mask 15% of tokens? Why not 10% or 50%? And why the 80/10/10 split for how masked tokens are handled?"

These are not trivia questions. They test whether you understand the fundamental design tradeoffs in pre-training: bidirectional context is strictly more informative than unidirectional context for understanding tasks, but the masking approach creates a train-test mismatch that must be carefully managed. Every number in the BERT paper reflects a deliberate engineering decision, and interviewers expect you to explain the reasoning.

What You Will Master

Explain why bidirectional context matters and what it costs
Describe the masked language modeling (MLM) objective in detail
Explain the 80/10/10 masking strategy and its rationale
Describe the next sentence prediction (NSP) task and why it was later removed
Compare BERT's architecture to the original Transformer and GPT
Explain the pre-training + fine-tuning paradigm
Discuss WordPiece tokenization and special tokens
Cite BERT's results on GLUE, SQuAD, and other benchmarks
Compare BERT to its successors (RoBERTa, ALBERT, DeBERTa)

Self-Assessment: Where Are You Now?

Skill	1 - Cannot	2 - Vaguely	3 - Can Explain	4 - Can Derive	5 - Can Teach	Your Score
Explain why bidirectional context helps						___
Describe masked language modeling						___
Explain the 80/10/10 strategy						___
Explain next sentence prediction						___
Draw BERT's architecture						___
Describe WordPiece tokenization						___
Explain pre-training + fine-tuning						___
Compare BERT vs GPT						___
Cite BERT's benchmark results						___
Discuss BERT's limitations and successors						___

Target: All 4s and 5s before your interview.

Part 1 - The Problem: Why Bidirectional?

The Limitation of Left-to-Right Language Models

Before BERT, the dominant approach for pre-training language representations was left-to-right language modeling (as used in GPT-1):

$P(w_t | w_1, w_2, \ldots, w_{t-1})$

This is powerful for generation - you can sample the next word given previous words. But for understanding tasks (classification, question answering, named entity recognition), it is fundamentally limited: the representation of each token can only see what came before it.

Consider the sentence: "The bank of the river was steep."

A left-to-right model processing the word "bank" can only see "The" - it cannot see "river" which disambiguates the meaning. The representation of "bank" must be committed before the model knows whether it means a financial institution or a riverbank.

BERT Bidirectional vs Unidirectional Context

Why Not Just Use a Bidirectional RNN?

ELMo (Peters et al., 2018) tried a different approach: train a forward LSTM and a backward LSTM independently, then concatenate their representations. But this is "shallowly bidirectional" - the forward and backward representations are computed independently and only concatenated at the end. They never attend to each other during computation.

BERT achieves "deeply bidirectional" representations by using a Transformer encoder where every layer's self-attention can attend to all positions in both directions simultaneously.

60-Second Answer

"BERT's key innovation is deeply bidirectional pre-training using the Transformer encoder. Previous language model approaches - including GPT-1 - could only condition on left context because predicting the next word requires not seeing it. BERT solves this with masked language modeling: randomly mask 15% of tokens and predict them, which allows the remaining 85% to attend in both directions. This produces richer representations for understanding tasks like classification and question answering, where context from both sides of a word matters."

Part 2 - Masked Language Modeling (MLM)

The Core Idea

Since a standard language model cannot be bidirectional (seeing the word you are trying to predict would be cheating), BERT introduces a different pre-training objective: randomly mask some input tokens and predict them.

$\mathcal{L}_{\text{MLM}} = -\sum_{i \in \text{masked}} \log P(w_i | w_{\backslash \text{masked}})$

This is essentially a fill-in-the-blank task, analogous to the Cloze task in linguistics.

The 15% Masking Rate

BERT randomly selects 15% of tokens for prediction. Why 15%?

Too low (e.g., 5%): Each training example provides very little signal. Training would need many more steps to converge because you only get gradients from 5% of positions.

Too high (e.g., 50%): Too little context remains for the model to make meaningful predictions. The task becomes more like denoising than language understanding. Also, the train-test distribution mismatch worsens (50% of tokens are masked in training, 0% at test time).

15% is a tradeoff: Enough signal per example for efficient training, enough context for meaningful predictions, and a manageable train-test mismatch.

The 80/10/10 Strategy

Of the 15% of tokens selected for prediction, BERT does not always replace them with [MASK]:

Action	Percentage	Example ("my dog is hairy")	Rationale
Replace with [MASK]	80%	"my dog is [MASK]"	Standard masking - forces prediction from context
Replace with random token	10%	"my dog is apple"	Forces the model to maintain a representation of every token, since any token might be wrong
Keep original token	10%	"my dog is hairy"	Reduces the train-test mismatch because the model sees real tokens during fine-tuning, not [MASK] tokens

import random

def bert_masking(tokens, vocab, mask_prob=0.15):
    """
    Apply BERT's masking strategy.

    Returns:
        masked_tokens: tokens with masking applied
        labels: original token at masked positions, -100 elsewhere
    """
    masked_tokens = tokens.copy()
    labels = [-100] * len(tokens)  # -100 = ignore in loss

    for i in range(len(tokens)):
        if random.random() < mask_prob:
            labels[i] = tokens[i]  # Store original for loss computation

            rand = random.random()
            if rand < 0.8:
                # 80%: replace with [MASK]
                masked_tokens[i] = "[MASK]"
            elif rand < 0.9:
                # 10%: replace with random token
                masked_tokens[i] = random.choice(vocab)
            else:
                # 10%: keep original
                pass  # masked_tokens[i] stays the same

    return masked_tokens, labels


# Example
vocab = ["the", "a", "dog", "cat", "is", "was", "big", "small", "hairy", "my"]
tokens = ["my", "dog", "is", "hairy"]

# Run multiple times to see different masking patterns
for trial in range(5):
    masked, labels = bert_masking(tokens, vocab)
    print(f"Trial {trial+1}: {masked} | Labels: {labels}")

Common Trap

Do not say "BERT replaces 15% of tokens with [MASK]." This is incomplete. BERT selects 15% of tokens for prediction, but only 80% of those are replaced with [MASK]. The remaining 20% are either replaced with a random token (10%) or kept as-is (10%). This nuance matters because it addresses the train-test mismatch: [MASK] never appears during fine-tuning.

Why This Matters: The Train-Test Mismatch

The [MASK] token is used during pre-training but never appears during fine-tuning or inference. If 100% of selected tokens were replaced with [MASK], the model would learn to extract information only when it sees [MASK], which is a distribution it never encounters downstream.

The 10% random replacement and 10% keep-original strategies mitigate this by teaching the model that it needs to maintain useful representations of all tokens, not just [MASK] positions.

Instant Rejection

If asked "What is the downside of masked language modeling compared to autoregressive LM?" and you cannot answer, it is a red flag. The main downside is that BERT only predicts 15% of tokens per forward pass, while an autoregressive LM predicts all tokens. This means BERT needs roughly $100/15 \approx 6.7\times$ more data to see the same number of prediction tasks, making it less sample-efficient for pre-training.

Part 3 - Next Sentence Prediction (NSP)

The Task

BERT's second pre-training objective is next sentence prediction (NSP). Given two sentences A and B:

50% of the time, B is the actual next sentence after A (label: IsNext)
50% of the time, B is a random sentence from the corpus (label: NotNext)

The model predicts whether B follows A using the [CLS] token representation.

Why NSP Was Included

The authors hypothesized that many downstream tasks (question answering, natural language inference) require understanding the relationship between two sentences, which a word-level objective like MLM does not directly capture.

Why NSP Was Later Removed

RoBERTa (Liu et al., 2019) showed that NSP does not actually help - and may hurt performance. The reasons:

The task is too easy. Random sentences from different documents are so obviously unrelated that the model does not learn meaningful sentence relationships. It mainly learns topic matching.
It conflates topic and coherence. Two sentences from the same document are labeled IsNext even if they are not truly coherent. Two sentences from different documents on the same topic are labeled NotNext.
MLM alone is sufficient. The bidirectional context in MLM already captures sentence-level relationships implicitly.

Model	NSP	MNLI	QNLI	SST-2
BERT	Yes	84.6	90.5	93.5
RoBERTa	No	87.6	92.8	94.8

RoBERTa removed NSP and achieved better results across the board, suggesting NSP was a net negative.

Part 4 - Architecture Details

BERT's Relation to the Transformer

BERT uses only the encoder portion of the original Transformer architecture. There is no decoder - BERT is not designed for generation.

BERT vs Original Transformer Architecture

Model Sizes

Parameter	BERT-Base	BERT-Large
Layers ( $L$ )	12	24
Hidden size ( $H$ )	768	1024
Attention heads ( $A$ )	12	16
Feed-forward size	3072	4096
Total parameters	110M	340M
Training data	16GB (BooksCorpus + English Wikipedia)	Same
Training time	4 days on 4 Cloud TPUs	4 days on 16 Cloud TPUs

Input Representation

BERT's input is the sum of three embeddings:

$\text{Input} = \text{Token Embedding} + \text{Segment Embedding} + \text{Position Embedding}$

BERT Input Construction: Token + Segment + Position Embeddings

Special tokens:

[CLS] - Classification token. Its final hidden state is used for sentence-level tasks. Placed at the start.
[SEP] - Separator token. Marks the boundary between sentences.
[MASK] - Used during MLM pre-training to indicate masked positions.

WordPiece Tokenization

BERT uses WordPiece tokenization, which breaks rare words into subword units:

# WordPiece tokenization examples
examples = {
    "playing":   ["play", "##ing"],
    "unhappiness": ["un", "##hap", "##pi", "##ness"],
    "transformer": ["transform", "##er"],
    "the":        ["the"],       # Common words stay as-is
    "embeddings":  ["em", "##bed", "##ding", "##s"],
}

# The ## prefix indicates a continuation (not a word start)
# Vocabulary size: 30,522 tokens

# Why WordPiece?
# 1. Handles out-of-vocabulary words (any word can be decomposed)
# 2. Balances vocabulary size vs. sequence length
# 3. Captures morphological structure (un-, -ing, -ness, etc.)
# 4. Shared subwords across languages (multilingual BERT)

Part 5 - The Pre-Training + Fine-Tuning Paradigm

Why This Was Revolutionary

Before BERT, NLP practitioners had two options:

Train from scratch on task-specific data (expensive, requires lots of labeled data)
Use pre-trained word embeddings (Word2Vec, GloVe) as features (limited - only captures word-level semantics)

BERT introduced a third option that was dramatically better: pre-train a deep bidirectional model on unlabeled text, then fine-tune the entire model on task-specific labeled data.

BERT Pre-Training and Fine-Tuning Pipeline

Fine-Tuning for Different Tasks

The same pre-trained BERT model can be fine-tuned for drastically different tasks by adding a minimal task-specific layer:

Task	Input	Output Layer	Example
Sentence classification	[CLS] sentence [SEP]	Linear on [CLS] → softmax	Sentiment analysis
Sentence pair classification	[CLS] sent1 [SEP] sent2 [SEP]	Linear on [CLS] → softmax	Natural language inference
Token classification	[CLS] tokens [SEP]	Linear on each token → softmax	Named entity recognition
Question answering	[CLS] question [SEP] passage [SEP]	Linear on passage tokens → start/end	SQuAD

# Pseudo-code for BERT fine-tuning on sentiment analysis

class BertForSentimentClassification:
    def __init__(self, num_classes=2):
        self.bert = load_pretrained_bert("bert-base-uncased")
        # Only new parameter: a linear layer on top of [CLS]
        self.classifier = Linear(768, num_classes)

    def forward(self, input_ids, attention_mask):
        # Run through pre-trained BERT
        outputs = self.bert(input_ids, attention_mask)

        # Take the [CLS] token representation (position 0)
        cls_representation = outputs[:, 0, :]  # (batch, 768)

        # Classify
        logits = self.classifier(cls_representation)  # (batch, num_classes)
        return logits

    def fine_tune(self, train_data, epochs=3, lr=2e-5):
        # Key: fine-tune ALL parameters, not just the classifier
        # Use a small learning rate to avoid catastrophic forgetting
        optimizer = Adam(self.parameters(), lr=lr)

        for epoch in range(epochs):
            for batch in train_data:
                logits = self.forward(batch.input_ids, batch.attention_mask)
                loss = cross_entropy(logits, batch.labels)
                loss.backward()
                optimizer.step()

Key Fine-Tuning Details

Learning rate: 2e-5 to 5e-5 (much smaller than pre-training to avoid catastrophic forgetting)
Epochs: 2-4 (more risks overfitting on small datasets)
Batch size: 16-32
Max sequence length: 128 or 512 (task-dependent)

Company Variation

At companies that use BERT-like models in production (Google Search, Amazon product search, financial NLP), interviewers may ask practical fine-tuning questions: "How would you fine-tune BERT on a dataset with only 500 labeled examples?" Answer: use a very low learning rate (1e-5), early stopping, data augmentation, and consider few-shot approaches or intermediate fine-tuning on a related task first.

Part 6 - Results and Ablations

GLUE Benchmark Results

BERT-Large achieved new state-of-the-art results on all 8 GLUE tasks:

Task	Previous SOTA	BERT-Base	BERT-Large	Improvement
MNLI	80.6	84.6	86.7	+6.1
QQP	66.1	71.2	72.1	+6.0
QNLI	-	90.5	92.7	-
SST-2	93.2	93.5	94.9	+1.7
CoLA	35.0	52.1	60.5	+25.5
STS-B	81.0	85.8	86.5	+5.5
MRPC	84.4	88.9	89.3	+4.9
RTE	61.7	66.4	70.1	+8.4

SQuAD Results

On the SQuAD question answering benchmark:

SQuAD 1.1: F1 = 93.2 (single model), surpassing human performance (91.2)
SQuAD 2.0: F1 = 83.1 (single model)

Ablation Studies

The BERT paper's ablation study answers several important questions:

Does bidirectionality help?

Model	MNLI	MRPC	SST-2
BERT-Base (bidirectional)	84.4	88.9	93.5
Left-to-right only	82.1	77.5	92.1
Left-to-right + right-to-left (concat)	82.1	81.9	92.2

Yes - deeply bidirectional is significantly better than concatenating two unidirectional models (the ELMo approach).

Does model size help?

Model	Params	MNLI	MRPC
BERT-Base	110M	84.4	88.9
BERT-Large	340M	86.7	89.3

Yes - more parameters help, even for relatively small downstream datasets.

Does NSP help?

The paper shows NSP improves performance on QNLI and MNLI. However, RoBERTa later showed this benefit was due to an unfair comparison - without NSP, using full sentences from the same document (instead of pairs from different documents) changes the data distribution in a way that confounds the comparison.

Part 7 - BERT vs GPT: The Fundamental Tradeoff

This is one of the most commonly asked comparison questions in ML interviews.

Aspect	BERT	GPT
Architecture	Transformer encoder	Transformer decoder
Directionality	Bidirectional	Unidirectional (left-to-right)
Pre-training objective	Masked language modeling	Next token prediction
Tokens predicted per batch	~15% of input tokens	100% of input tokens
Training efficiency	Less efficient (15% signal)	More efficient (100% signal)
Best for	Understanding (classification, NER, QA)	Generation (text completion, dialogue)
Can generate text?	Not naturally (can with tricks)	Yes, natively
Can classify text?	Yes, via [CLS] + fine-tuning	Yes, but bidirectional is better
Context	Full context for every token	Only left context for each token

BERT vs GPT: Understanding vs Generation Tradeoff

The Key Insight for Interviews

"BERT and GPT made opposite architectural choices on the same tradeoff: bidirectionality gives BERT richer representations for understanding tasks, while autoregressive modeling gives GPT the ability to generate text naturally. Neither is strictly better - the right choice depends on the downstream application. Modern approaches like T5 use encoder-decoder architectures that can do both, and GPT-3 showed that sufficiently large autoregressive models can also perform understanding tasks through in-context learning."

Common Trap

Do not say "BERT is better than GPT" or vice versa. They solve different problems. BERT is better for understanding tasks with fine-tuning. GPT is better for generation and, at sufficient scale, can handle understanding tasks through prompting. Interviewers want to see that you understand the tradeoff, not that you have a favorite.

Part 8 - BERT's Successors

RoBERTa (2019)

Key changes:

Removed NSP objective
Trained on 10x more data (160GB vs 16GB)
Used dynamic masking (different masks each epoch, not static)
Longer training with larger batches
Result: Significant improvements across all benchmarks while using the same architecture

Interview insight: RoBERTa proved that BERT was undertrained. The architecture was fine - the training recipe needed improvement.

ALBERT (2019)

Key changes:

Factorized embedding parameterization ( $V \times H$ → $V \times E + E \times H$ )
Cross-layer parameter sharing (all layers share the same weights)
Replaced NSP with sentence-order prediction (SOP)
Result: 89% fewer parameters with comparable performance

DeBERTa (2020)

Key changes:

Disentangled attention (separate content and position attention)
Enhanced mask decoder for pre-training
Result: First model to surpass human performance on SuperGLUE

Comparison Table

Model	Year	Parameters	MNLI	Key Innovation
BERT-Large	2018	340M	86.7	Bidirectional pre-training
RoBERTa	2019	355M	90.2	Better training recipe
ALBERT-xxlarge	2019	235M	90.8	Parameter sharing
DeBERTa	2020	350M	91.1	Disentangled attention

Part 9 - How BERT Is Used in Production

Google Search

BERT was deployed in Google Search in October 2019, applied to 10% of English queries. It improved the understanding of prepositions and context-dependent queries.

Example: "2019 brazil traveler to usa need a visa" - BERT understands that "to" indicates the traveler is going TO the USA, not FROM the USA.

Practical Deployment Considerations

Challenge	Solution
Latency: BERT is slow (12 layers, 110M params)	Distillation (DistilBERT: 40% smaller, 60% faster, 97% performance)
Long documents: BERT limited to 512 tokens	Chunking, Longformer, or hierarchical approaches
Domain adaptation: General BERT may not work for medical/legal text	Domain-specific pre-training (BioBERT, LegalBERT, FinBERT)
Multilingual: BERT was primarily English	mBERT (104 languages), XLM-RoBERTa

# Example: Using BERT for classification (Hugging Face)

# from transformers import BertTokenizer, BertForSequenceClassification
# import torch

# 1. Load pre-trained model and tokenizer
# tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# model = BertForSequenceClassification.from_pretrained(
#     'bert-base-uncased', num_labels=2
# )

# 2. Tokenize input
# inputs = tokenizer(
#     "This movie was absolutely fantastic!",
#     return_tensors="pt",
#     padding=True,
#     truncation=True,
#     max_length=128
# )
# Input structure:
# {
#     'input_ids': tensor([[101, 2023, 3185, 2001, 7078, 10392, 999, 102]]),
#     'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]]),
#     'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0]])
# }
# 101 = [CLS], 102 = [SEP]

# 3. Forward pass
# outputs = model(**inputs)
# logits = outputs.logits  # (1, 2)
# prediction = torch.argmax(logits, dim=-1)  # 0 or 1

Part 10 - Why BERT Still Matters in 2024+

Despite the rise of GPT-style models, BERT-style models remain widely used:

Search and retrieval: Bi-encoder models for dense retrieval (based on BERT) are the backbone of modern search.
Classification at scale: When you need to classify millions of documents, a fine-tuned BERT is faster and cheaper than prompting GPT-4.
Named entity recognition: BERT-based NER remains the production standard.
Sentence embeddings: Models like SBERT (Sentence-BERT) produce high-quality sentence embeddings using BERT architecture.
Reranking: Cross-encoders (BERT with two sentences as input) are used to rerank search results.

60-Second Answer

"BERT remains relevant because many production NLP tasks are understanding tasks - classification, search, NER, information extraction - where bidirectional context and fine-tuning are more efficient and cost-effective than prompting large autoregressive models. A fine-tuned BERT model can process thousands of documents per second, while a GPT-4 API call takes hundreds of milliseconds. For high-volume, latency-sensitive applications, BERT-style models are still the right choice."

Practice Problems

Problem 1: Masking Rate Analysis

If BERT masked 50% of tokens instead of 15%, what would happen? Consider both training efficiency and model quality.

Hint

At 50% masking: (1) More prediction signal per example (potentially faster convergence per step), but (2) much less context to predict from (harder predictions, possibly too noisy), (3) worse train-test mismatch (the model sees 50% masked tokens during pre-training but 0% during fine-tuning), and (4) each token's representation is informed by only half the context. Research (SpanBERT, etc.) suggests 15% is near-optimal for the context/signal tradeoff.

Problem 2: NSP Replacement

Design a sentence-level pre-training task that is more useful than NSP but still requires no labeled data.

Hint

ALBERT's sentence-order prediction (SOP) is one answer: given two consecutive sentences, predict whether they are in the correct order. This is harder than NSP (both sentences come from the same document, so topic matching does not help) and directly tests discourse coherence understanding.

Problem 3: BERT for Generation

Can BERT generate text? If so, how? What are the limitations compared to GPT-style generation?

Hint

BERT can generate text using iterative masking: start with all [MASK] tokens, predict the most confident token, fix it, and repeat. This is called "mask-predict" or "non-autoregressive generation." It is parallel (can predict multiple tokens simultaneously) but lower quality because tokens are predicted somewhat independently, without the strict left-to-right conditioning of autoregressive models.

Problem 4: Tokenization Impact

A user reports that BERT performs poorly on tweets. One issue is that the WordPiece vocabulary was built on Wikipedia and Books. How would you fix this?

Hint

Options: (1) Build a new WordPiece vocabulary on tweet data and pre-train from scratch (BERTweet did this), (2) use a larger subword vocabulary that includes common tweet tokens, (3) pre-process tweets to expand abbreviations and emojis, or (4) continue pre-training (adaptive pre-training) on tweet data while keeping the existing vocabulary. Option 4 is usually the best cost/performance tradeoff.

Problem 5: Efficiency Comparison

Calculate the theoretical training efficiency difference between BERT (15% masking) and GPT (100% token prediction) for the same dataset size.

Hint

For each training example of length $n$ : GPT makes $n$ predictions (one per position), BERT makes $0.15n$ predictions. So GPT gets $\sim 6.7\times$ more prediction signal per forward pass. However, each BERT prediction uses bidirectional context (richer), while each GPT prediction uses only left context. The net effect: BERT typically needs more compute to match GPT's pre-training loss, but produces better representations for understanding tasks.

Interview Cheat Sheet

Question	Key Points
"What is BERT?"	Bidirectional Transformer encoder, pre-trained with MLM + NSP, fine-tuned for downstream tasks
"Why bidirectional?"	Understanding tasks need context from both sides. "bank" needs to see "river" on the right.
"Why 15% masking?"	Tradeoff: enough signal per example, enough context for prediction, manageable train-test mismatch
"Explain the 80/10/10 strategy"	Mitigates train-test mismatch. Random replacement forces representation of all tokens. Keep-original bridges to fine-tuning.
"Why was NSP removed?"	Too easy (topic matching), confounds topic/coherence, RoBERTa showed it hurts
"BERT vs GPT?"	BERT: bidirectional, better for understanding. GPT: unidirectional, better for generation. Fundamental tradeoff.
"How is BERT fine-tuned?"	Small learning rate (2e-5), 2-4 epochs, task-specific head on [CLS] or token representations
"BERT for production?"	Distillation (DistilBERT), domain pre-training, 512 token limit → chunking
"What is WordPiece?"	Subword tokenization that handles OOV words by decomposing into known subwords
"BERT's main limitation?"	512 token limit, slow for generation, 15% prediction efficiency, not suitable for generative tasks

Spaced Repetition Checkpoints

Day 0 (Today)

Understand MLM and the 80/10/10 strategy
Understand why bidirectional context matters
Draw BERT's architecture from memory

Day 3

Explain BERT vs GPT tradeoffs from memory
Describe the pre-training + fine-tuning paradigm
Explain why NSP was removed in RoBERTa

Day 7

Practice a 10-minute BERT presentation
Cite specific benchmark results
Explain fine-tuning for 3 different task types

Day 14

Compare BERT, RoBERTa, ALBERT, DeBERTa
Discuss production deployment challenges
Do a mock paper discussion on BERT

Day 21

Full mock interview covering BERT and Transformer
Handle surprise follow-up questions
Connect BERT to modern LLM developments

Next Steps

You now understand both the Transformer architecture and how BERT adapted its encoder for bidirectional pre-training. Next, explore the other side of the coin with Chapter 5: GPT Series - how the Transformer decoder was scaled from GPT-1's 117M parameters to GPT-4's rumored trillion+, and how the paradigm shifted from fine-tuning to in-context learning.

The Real Interview Moment​

What You Will Master​

Self-Assessment: Where Are You Now?​

Part 1 - The Problem: Why Bidirectional?​

The Limitation of Left-to-Right Language Models​

Why Not Just Use a Bidirectional RNN?​

Part 2 - Masked Language Modeling (MLM)​

The Core Idea​

The 15% Masking Rate​

The 80/10/10 Strategy​

Why This Matters: The Train-Test Mismatch​

Part 3 - Next Sentence Prediction (NSP)​

The Task​

Why NSP Was Included​

Why NSP Was Later Removed​

Part 4 - Architecture Details​

BERT's Relation to the Transformer​

Model Sizes​

Input Representation​

WordPiece Tokenization​

Part 5 - The Pre-Training + Fine-Tuning Paradigm​

Why This Was Revolutionary​

Fine-Tuning for Different Tasks​

Key Fine-Tuning Details​

Part 6 - Results and Ablations​

GLUE Benchmark Results​

SQuAD Results​

Ablation Studies​

Part 7 - BERT vs GPT: The Fundamental Tradeoff​

The Key Insight for Interviews​

Part 8 - BERT's Successors​

RoBERTa (2019)​

ALBERT (2019)​

DeBERTa (2020)​

Comparison Table​

Part 9 - How BERT Is Used in Production​

Google Search​

Practical Deployment Considerations​

Part 10 - Why BERT Still Matters in 2024+​

Practice Problems​

Problem 1: Masking Rate Analysis​

Problem 2: NSP Replacement​

Problem 3: BERT for Generation​

Problem 4: Tokenization Impact​

Problem 5: Efficiency Comparison​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

Day 0 (Today)​

Day 3​

Day 7​

Day 14​

Day 21​

Next Steps​

The Real Interview Moment

What You Will Master

Self-Assessment: Where Are You Now?

Part 1 - The Problem: Why Bidirectional?

The Limitation of Left-to-Right Language Models

Why Not Just Use a Bidirectional RNN?

Part 2 - Masked Language Modeling (MLM)

The Core Idea

The 15% Masking Rate

The 80/10/10 Strategy

Why This Matters: The Train-Test Mismatch

Part 3 - Next Sentence Prediction (NSP)

The Task

Why NSP Was Included

Why NSP Was Later Removed

Part 4 - Architecture Details

BERT's Relation to the Transformer

Model Sizes

Input Representation

WordPiece Tokenization

Part 5 - The Pre-Training + Fine-Tuning Paradigm

Why This Was Revolutionary

Fine-Tuning for Different Tasks

Key Fine-Tuning Details

Part 6 - Results and Ablations

GLUE Benchmark Results

SQuAD Results

Ablation Studies

Part 7 - BERT vs GPT: The Fundamental Tradeoff

The Key Insight for Interviews

Part 8 - BERT's Successors

RoBERTa (2019)

ALBERT (2019)

DeBERTa (2020)

Comparison Table

Part 9 - How BERT Is Used in Production

Google Search

Practical Deployment Considerations

Part 10 - Why BERT Still Matters in 2024+

Practice Problems

Problem 1: Masking Rate Analysis

Problem 2: NSP Replacement

Problem 3: BERT for Generation

Problem 4: Tokenization Impact

Problem 5: Efficiency Comparison

Interview Cheat Sheet

Spaced Repetition Checkpoints

Day 0 (Today)

Day 3

Day 7

Day 14

Day 21

Next Steps