BERT - Bidirectional Pre-Training That Redefined NLP
Reading time: ~40 min | Interview relevance: Critical | Roles: MLE, AI Eng, Research Engineer, Data Scientist
The Real Interview Moment
You are in a Google NLP team interview. The senior researcher asks: "BERT uses masked language modeling. Why not just use a standard language model like GPT? What does bidirectionality buy you, and what does it cost?" You start to answer, and she immediately follows up: "Why does BERT mask 15% of tokens? Why not 10% or 50%? And why the 80/10/10 split for how masked tokens are handled?"
These are not trivia questions. They test whether you understand the fundamental design tradeoffs in pre-training: bidirectional context is strictly more informative than unidirectional context for understanding tasks, but the masking approach creates a train-test mismatch that must be carefully managed. Every number in the BERT paper reflects a deliberate engineering decision, and interviewers expect you to explain the reasoning.
What You Will Master
- Explain why bidirectional context matters and what it costs
- Describe the masked language modeling (MLM) objective in detail
- Explain the 80/10/10 masking strategy and its rationale
- Describe the next sentence prediction (NSP) task and why it was later removed
- Compare BERT's architecture to the original Transformer and GPT
- Explain the pre-training + fine-tuning paradigm
- Discuss WordPiece tokenization and special tokens
- Cite BERT's results on GLUE, SQuAD, and other benchmarks
- Compare BERT to its successors (RoBERTa, ALBERT, DeBERTa)
Self-Assessment: Where Are You Now?
| Skill | 1 - Cannot | 2 - Vaguely | 3 - Can Explain | 4 - Can Derive | 5 - Can Teach | Your Score |
|---|---|---|---|---|---|---|
| Explain why bidirectional context helps | ___ | |||||
| Describe masked language modeling | ___ | |||||
| Explain the 80/10/10 strategy | ___ | |||||
| Explain next sentence prediction | ___ | |||||
| Draw BERT's architecture | ___ | |||||
| Describe WordPiece tokenization | ___ | |||||
| Explain pre-training + fine-tuning | ___ | |||||
| Compare BERT vs GPT | ___ | |||||
| Cite BERT's benchmark results | ___ | |||||
| Discuss BERT's limitations and successors | ___ |
Target: All 4s and 5s before your interview.
Part 1 - The Problem: Why Bidirectional?
The Limitation of Left-to-Right Language Models
Before BERT, the dominant approach for pre-training language representations was left-to-right language modeling (as used in GPT-1):
This is powerful for generation - you can sample the next word given previous words. But for understanding tasks (classification, question answering, named entity recognition), it is fundamentally limited: the representation of each token can only see what came before it.
Consider the sentence: "The bank of the river was steep."
A left-to-right model processing the word "bank" can only see "The" - it cannot see "river" which disambiguates the meaning. The representation of "bank" must be committed before the model knows whether it means a financial institution or a riverbank.
Why Not Just Use a Bidirectional RNN?
ELMo (Peters et al., 2018) tried a different approach: train a forward LSTM and a backward LSTM independently, then concatenate their representations. But this is "shallowly bidirectional" - the forward and backward representations are computed independently and only concatenated at the end. They never attend to each other during computation.
BERT achieves "deeply bidirectional" representations by using a Transformer encoder where every layer's self-attention can attend to all positions in both directions simultaneously.
"BERT's key innovation is deeply bidirectional pre-training using the Transformer encoder. Previous language model approaches - including GPT-1 - could only condition on left context because predicting the next word requires not seeing it. BERT solves this with masked language modeling: randomly mask 15% of tokens and predict them, which allows the remaining 85% to attend in both directions. This produces richer representations for understanding tasks like classification and question answering, where context from both sides of a word matters."
Part 2 - Masked Language Modeling (MLM)
The Core Idea
Since a standard language model cannot be bidirectional (seeing the word you are trying to predict would be cheating), BERT introduces a different pre-training objective: randomly mask some input tokens and predict them.
This is essentially a fill-in-the-blank task, analogous to the Cloze task in linguistics.
The 15% Masking Rate
BERT randomly selects 15% of tokens for prediction. Why 15%?
Too low (e.g., 5%): Each training example provides very little signal. Training would need many more steps to converge because you only get gradients from 5% of positions.
Too high (e.g., 50%): Too little context remains for the model to make meaningful predictions. The task becomes more like denoising than language understanding. Also, the train-test distribution mismatch worsens (50% of tokens are masked in training, 0% at test time).
15% is a tradeoff: Enough signal per example for efficient training, enough context for meaningful predictions, and a manageable train-test mismatch.
The 80/10/10 Strategy
Of the 15% of tokens selected for prediction, BERT does not always replace them with [MASK]:
| Action | Percentage | Example ("my dog is hairy") | Rationale |
|---|---|---|---|
| Replace with [MASK] | 80% | "my dog is [MASK]" | Standard masking - forces prediction from context |
| Replace with random token | 10% | "my dog is apple" | Forces the model to maintain a representation of every token, since any token might be wrong |
| Keep original token | 10% | "my dog is hairy" | Reduces the train-test mismatch because the model sees real tokens during fine-tuning, not [MASK] tokens |
import random
def bert_masking(tokens, vocab, mask_prob=0.15):
"""
Apply BERT's masking strategy.
Returns:
masked_tokens: tokens with masking applied
labels: original token at masked positions, -100 elsewhere
"""
masked_tokens = tokens.copy()
labels = [-100] * len(tokens) # -100 = ignore in loss
for i in range(len(tokens)):
if random.random() < mask_prob:
labels[i] = tokens[i] # Store original for loss computation
rand = random.random()
if rand < 0.8:
# 80%: replace with [MASK]
masked_tokens[i] = "[MASK]"
elif rand < 0.9:
# 10%: replace with random token
masked_tokens[i] = random.choice(vocab)
else:
# 10%: keep original
pass # masked_tokens[i] stays the same
return masked_tokens, labels
# Example
vocab = ["the", "a", "dog", "cat", "is", "was", "big", "small", "hairy", "my"]
tokens = ["my", "dog", "is", "hairy"]
# Run multiple times to see different masking patterns
for trial in range(5):
masked, labels = bert_masking(tokens, vocab)
print(f"Trial {trial+1}: {masked} | Labels: {labels}")
Do not say "BERT replaces 15% of tokens with [MASK]." This is incomplete. BERT selects 15% of tokens for prediction, but only 80% of those are replaced with [MASK]. The remaining 20% are either replaced with a random token (10%) or kept as-is (10%). This nuance matters because it addresses the train-test mismatch: [MASK] never appears during fine-tuning.
Why This Matters: The Train-Test Mismatch
The [MASK] token is used during pre-training but never appears during fine-tuning or inference. If 100% of selected tokens were replaced with [MASK], the model would learn to extract information only when it sees [MASK], which is a distribution it never encounters downstream.
The 10% random replacement and 10% keep-original strategies mitigate this by teaching the model that it needs to maintain useful representations of all tokens, not just [MASK] positions.
If asked "What is the downside of masked language modeling compared to autoregressive LM?" and you cannot answer, it is a red flag. The main downside is that BERT only predicts 15% of tokens per forward pass, while an autoregressive LM predicts all tokens. This means BERT needs roughly more data to see the same number of prediction tasks, making it less sample-efficient for pre-training.
Part 3 - Next Sentence Prediction (NSP)
The Task
BERT's second pre-training objective is next sentence prediction (NSP). Given two sentences A and B:
- 50% of the time, B is the actual next sentence after A (label:
IsNext) - 50% of the time, B is a random sentence from the corpus (label:
NotNext)
The model predicts whether B follows A using the [CLS] token representation.
Why NSP Was Included
The authors hypothesized that many downstream tasks (question answering, natural language inference) require understanding the relationship between two sentences, which a word-level objective like MLM does not directly capture.
Why NSP Was Later Removed
RoBERTa (Liu et al., 2019) showed that NSP does not actually help - and may hurt performance. The reasons:
- The task is too easy. Random sentences from different documents are so obviously unrelated that the model does not learn meaningful sentence relationships. It mainly learns topic matching.
- It conflates topic and coherence. Two sentences from the same document are labeled
IsNexteven if they are not truly coherent. Two sentences from different documents on the same topic are labeledNotNext. - MLM alone is sufficient. The bidirectional context in MLM already captures sentence-level relationships implicitly.
| Model | NSP | MNLI | QNLI | SST-2 |
|---|---|---|---|---|
| BERT | Yes | 84.6 | 90.5 | 93.5 |
| RoBERTa | No | 87.6 | 92.8 | 94.8 |
RoBERTa removed NSP and achieved better results across the board, suggesting NSP was a net negative.
Part 4 - Architecture Details
BERT's Relation to the Transformer
BERT uses only the encoder portion of the original Transformer architecture. There is no decoder - BERT is not designed for generation.
Model Sizes
| Parameter | BERT-Base | BERT-Large |
|---|---|---|
| Layers () | 12 | 24 |
| Hidden size () | 768 | 1024 |
| Attention heads () | 12 | 16 |
| Feed-forward size | 3072 | 4096 |
| Total parameters | 110M | 340M |
| Training data | 16GB (BooksCorpus + English Wikipedia) | Same |
| Training time | 4 days on 4 Cloud TPUs | 4 days on 16 Cloud TPUs |
Input Representation
BERT's input is the sum of three embeddings:
Special tokens:
[CLS]- Classification token. Its final hidden state is used for sentence-level tasks. Placed at the start.[SEP]- Separator token. Marks the boundary between sentences.[MASK]- Used during MLM pre-training to indicate masked positions.
WordPiece Tokenization
BERT uses WordPiece tokenization, which breaks rare words into subword units:
# WordPiece tokenization examples
examples = {
"playing": ["play", "##ing"],
"unhappiness": ["un", "##hap", "##pi", "##ness"],
"transformer": ["transform", "##er"],
"the": ["the"], # Common words stay as-is
"embeddings": ["em", "##bed", "##ding", "##s"],
}
# The ## prefix indicates a continuation (not a word start)
# Vocabulary size: 30,522 tokens
# Why WordPiece?
# 1. Handles out-of-vocabulary words (any word can be decomposed)
# 2. Balances vocabulary size vs. sequence length
# 3. Captures morphological structure (un-, -ing, -ness, etc.)
# 4. Shared subwords across languages (multilingual BERT)
Part 5 - The Pre-Training + Fine-Tuning Paradigm
Why This Was Revolutionary
Before BERT, NLP practitioners had two options:
- Train from scratch on task-specific data (expensive, requires lots of labeled data)
- Use pre-trained word embeddings (Word2Vec, GloVe) as features (limited - only captures word-level semantics)
BERT introduced a third option that was dramatically better: pre-train a deep bidirectional model on unlabeled text, then fine-tune the entire model on task-specific labeled data.
Fine-Tuning for Different Tasks
The same pre-trained BERT model can be fine-tuned for drastically different tasks by adding a minimal task-specific layer:
| Task | Input | Output Layer | Example |
|---|---|---|---|
| Sentence classification | [CLS] sentence [SEP] | Linear on [CLS] → softmax | Sentiment analysis |
| Sentence pair classification | [CLS] sent1 [SEP] sent2 [SEP] | Linear on [CLS] → softmax | Natural language inference |
| Token classification | [CLS] tokens [SEP] | Linear on each token → softmax | Named entity recognition |
| Question answering | [CLS] question [SEP] passage [SEP] | Linear on passage tokens → start/end | SQuAD |
# Pseudo-code for BERT fine-tuning on sentiment analysis
class BertForSentimentClassification:
def __init__(self, num_classes=2):
self.bert = load_pretrained_bert("bert-base-uncased")
# Only new parameter: a linear layer on top of [CLS]
self.classifier = Linear(768, num_classes)
def forward(self, input_ids, attention_mask):
# Run through pre-trained BERT
outputs = self.bert(input_ids, attention_mask)
# Take the [CLS] token representation (position 0)
cls_representation = outputs[:, 0, :] # (batch, 768)
# Classify
logits = self.classifier(cls_representation) # (batch, num_classes)
return logits
def fine_tune(self, train_data, epochs=3, lr=2e-5):
# Key: fine-tune ALL parameters, not just the classifier
# Use a small learning rate to avoid catastrophic forgetting
optimizer = Adam(self.parameters(), lr=lr)
for epoch in range(epochs):
for batch in train_data:
logits = self.forward(batch.input_ids, batch.attention_mask)
loss = cross_entropy(logits, batch.labels)
loss.backward()
optimizer.step()
Key Fine-Tuning Details
- Learning rate: 2e-5 to 5e-5 (much smaller than pre-training to avoid catastrophic forgetting)
- Epochs: 2-4 (more risks overfitting on small datasets)
- Batch size: 16-32
- Max sequence length: 128 or 512 (task-dependent)
At companies that use BERT-like models in production (Google Search, Amazon product search, financial NLP), interviewers may ask practical fine-tuning questions: "How would you fine-tune BERT on a dataset with only 500 labeled examples?" Answer: use a very low learning rate (1e-5), early stopping, data augmentation, and consider few-shot approaches or intermediate fine-tuning on a related task first.
Part 6 - Results and Ablations
GLUE Benchmark Results
BERT-Large achieved new state-of-the-art results on all 8 GLUE tasks:
| Task | Previous SOTA | BERT-Base | BERT-Large | Improvement |
|---|---|---|---|---|
| MNLI | 80.6 | 84.6 | 86.7 | +6.1 |
| QQP | 66.1 | 71.2 | 72.1 | +6.0 |
| QNLI | - | 90.5 | 92.7 | - |
| SST-2 | 93.2 | 93.5 | 94.9 | +1.7 |
| CoLA | 35.0 | 52.1 | 60.5 | +25.5 |
| STS-B | 81.0 | 85.8 | 86.5 | +5.5 |
| MRPC | 84.4 | 88.9 | 89.3 | +4.9 |
| RTE | 61.7 | 66.4 | 70.1 | +8.4 |
SQuAD Results
On the SQuAD question answering benchmark:
- SQuAD 1.1: F1 = 93.2 (single model), surpassing human performance (91.2)
- SQuAD 2.0: F1 = 83.1 (single model)
Ablation Studies
The BERT paper's ablation study answers several important questions:
Does bidirectionality help?
| Model | MNLI | MRPC | SST-2 |
|---|---|---|---|
| BERT-Base (bidirectional) | 84.4 | 88.9 | 93.5 |
| Left-to-right only | 82.1 | 77.5 | 92.1 |
| Left-to-right + right-to-left (concat) | 82.1 | 81.9 | 92.2 |
Yes - deeply bidirectional is significantly better than concatenating two unidirectional models (the ELMo approach).
Does model size help?
| Model | Params | MNLI | MRPC |
|---|---|---|---|
| BERT-Base | 110M | 84.4 | 88.9 |
| BERT-Large | 340M | 86.7 | 89.3 |
Yes - more parameters help, even for relatively small downstream datasets.
Does NSP help?
The paper shows NSP improves performance on QNLI and MNLI. However, RoBERTa later showed this benefit was due to an unfair comparison - without NSP, using full sentences from the same document (instead of pairs from different documents) changes the data distribution in a way that confounds the comparison.
Part 7 - BERT vs GPT: The Fundamental Tradeoff
This is one of the most commonly asked comparison questions in ML interviews.
| Aspect | BERT | GPT |
|---|---|---|
| Architecture | Transformer encoder | Transformer decoder |
| Directionality | Bidirectional | Unidirectional (left-to-right) |
| Pre-training objective | Masked language modeling | Next token prediction |
| Tokens predicted per batch | ~15% of input tokens | 100% of input tokens |
| Training efficiency | Less efficient (15% signal) | More efficient (100% signal) |
| Best for | Understanding (classification, NER, QA) | Generation (text completion, dialogue) |
| Can generate text? | Not naturally (can with tricks) | Yes, natively |
| Can classify text? | Yes, via [CLS] + fine-tuning | Yes, but bidirectional is better |
| Context | Full context for every token | Only left context for each token |
The Key Insight for Interviews
"BERT and GPT made opposite architectural choices on the same tradeoff: bidirectionality gives BERT richer representations for understanding tasks, while autoregressive modeling gives GPT the ability to generate text naturally. Neither is strictly better - the right choice depends on the downstream application. Modern approaches like T5 use encoder-decoder architectures that can do both, and GPT-3 showed that sufficiently large autoregressive models can also perform understanding tasks through in-context learning."
Do not say "BERT is better than GPT" or vice versa. They solve different problems. BERT is better for understanding tasks with fine-tuning. GPT is better for generation and, at sufficient scale, can handle understanding tasks through prompting. Interviewers want to see that you understand the tradeoff, not that you have a favorite.
Part 8 - BERT's Successors
RoBERTa (2019)
Key changes:
- Removed NSP objective
- Trained on 10x more data (160GB vs 16GB)
- Used dynamic masking (different masks each epoch, not static)
- Longer training with larger batches
- Result: Significant improvements across all benchmarks while using the same architecture
Interview insight: RoBERTa proved that BERT was undertrained. The architecture was fine - the training recipe needed improvement.
ALBERT (2019)
Key changes:
- Factorized embedding parameterization ( → )
- Cross-layer parameter sharing (all layers share the same weights)
- Replaced NSP with sentence-order prediction (SOP)
- Result: 89% fewer parameters with comparable performance
DeBERTa (2020)
Key changes:
- Disentangled attention (separate content and position attention)
- Enhanced mask decoder for pre-training
- Result: First model to surpass human performance on SuperGLUE
Comparison Table
| Model | Year | Parameters | MNLI | Key Innovation |
|---|---|---|---|---|
| BERT-Large | 2018 | 340M | 86.7 | Bidirectional pre-training |
| RoBERTa | 2019 | 355M | 90.2 | Better training recipe |
| ALBERT-xxlarge | 2019 | 235M | 90.8 | Parameter sharing |
| DeBERTa | 2020 | 350M | 91.1 | Disentangled attention |
Part 9 - How BERT Is Used in Production
Google Search
BERT was deployed in Google Search in October 2019, applied to 10% of English queries. It improved the understanding of prepositions and context-dependent queries.
Example: "2019 brazil traveler to usa need a visa" - BERT understands that "to" indicates the traveler is going TO the USA, not FROM the USA.
Practical Deployment Considerations
| Challenge | Solution |
|---|---|
| Latency: BERT is slow (12 layers, 110M params) | Distillation (DistilBERT: 40% smaller, 60% faster, 97% performance) |
| Long documents: BERT limited to 512 tokens | Chunking, Longformer, or hierarchical approaches |
| Domain adaptation: General BERT may not work for medical/legal text | Domain-specific pre-training (BioBERT, LegalBERT, FinBERT) |
| Multilingual: BERT was primarily English | mBERT (104 languages), XLM-RoBERTa |
# Example: Using BERT for classification (Hugging Face)
# from transformers import BertTokenizer, BertForSequenceClassification
# import torch
# 1. Load pre-trained model and tokenizer
# tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# model = BertForSequenceClassification.from_pretrained(
# 'bert-base-uncased', num_labels=2
# )
# 2. Tokenize input
# inputs = tokenizer(
# "This movie was absolutely fantastic!",
# return_tensors="pt",
# padding=True,
# truncation=True,
# max_length=128
# )
# Input structure:
# {
# 'input_ids': tensor([[101, 2023, 3185, 2001, 7078, 10392, 999, 102]]),
# 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]]),
# 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0]])
# }
# 101 = [CLS], 102 = [SEP]
# 3. Forward pass
# outputs = model(**inputs)
# logits = outputs.logits # (1, 2)
# prediction = torch.argmax(logits, dim=-1) # 0 or 1
Part 10 - Why BERT Still Matters in 2024+
Despite the rise of GPT-style models, BERT-style models remain widely used:
- Search and retrieval: Bi-encoder models for dense retrieval (based on BERT) are the backbone of modern search.
- Classification at scale: When you need to classify millions of documents, a fine-tuned BERT is faster and cheaper than prompting GPT-4.
- Named entity recognition: BERT-based NER remains the production standard.
- Sentence embeddings: Models like SBERT (Sentence-BERT) produce high-quality sentence embeddings using BERT architecture.
- Reranking: Cross-encoders (BERT with two sentences as input) are used to rerank search results.
"BERT remains relevant because many production NLP tasks are understanding tasks - classification, search, NER, information extraction - where bidirectional context and fine-tuning are more efficient and cost-effective than prompting large autoregressive models. A fine-tuned BERT model can process thousands of documents per second, while a GPT-4 API call takes hundreds of milliseconds. For high-volume, latency-sensitive applications, BERT-style models are still the right choice."
Practice Problems
Problem 1: Masking Rate Analysis
If BERT masked 50% of tokens instead of 15%, what would happen? Consider both training efficiency and model quality.
Hint
At 50% masking: (1) More prediction signal per example (potentially faster convergence per step), but (2) much less context to predict from (harder predictions, possibly too noisy), (3) worse train-test mismatch (the model sees 50% masked tokens during pre-training but 0% during fine-tuning), and (4) each token's representation is informed by only half the context. Research (SpanBERT, etc.) suggests 15% is near-optimal for the context/signal tradeoff.
Problem 2: NSP Replacement
Design a sentence-level pre-training task that is more useful than NSP but still requires no labeled data.
Hint
ALBERT's sentence-order prediction (SOP) is one answer: given two consecutive sentences, predict whether they are in the correct order. This is harder than NSP (both sentences come from the same document, so topic matching does not help) and directly tests discourse coherence understanding.
Problem 3: BERT for Generation
Can BERT generate text? If so, how? What are the limitations compared to GPT-style generation?
Hint
BERT can generate text using iterative masking: start with all [MASK] tokens, predict the most confident token, fix it, and repeat. This is called "mask-predict" or "non-autoregressive generation." It is parallel (can predict multiple tokens simultaneously) but lower quality because tokens are predicted somewhat independently, without the strict left-to-right conditioning of autoregressive models.
Problem 4: Tokenization Impact
A user reports that BERT performs poorly on tweets. One issue is that the WordPiece vocabulary was built on Wikipedia and Books. How would you fix this?
Hint
Options: (1) Build a new WordPiece vocabulary on tweet data and pre-train from scratch (BERTweet did this), (2) use a larger subword vocabulary that includes common tweet tokens, (3) pre-process tweets to expand abbreviations and emojis, or (4) continue pre-training (adaptive pre-training) on tweet data while keeping the existing vocabulary. Option 4 is usually the best cost/performance tradeoff.
Problem 5: Efficiency Comparison
Calculate the theoretical training efficiency difference between BERT (15% masking) and GPT (100% token prediction) for the same dataset size.
Hint
For each training example of length : GPT makes predictions (one per position), BERT makes predictions. So GPT gets more prediction signal per forward pass. However, each BERT prediction uses bidirectional context (richer), while each GPT prediction uses only left context. The net effect: BERT typically needs more compute to match GPT's pre-training loss, but produces better representations for understanding tasks.
Interview Cheat Sheet
| Question | Key Points |
|---|---|
| "What is BERT?" | Bidirectional Transformer encoder, pre-trained with MLM + NSP, fine-tuned for downstream tasks |
| "Why bidirectional?" | Understanding tasks need context from both sides. "bank" needs to see "river" on the right. |
| "Why 15% masking?" | Tradeoff: enough signal per example, enough context for prediction, manageable train-test mismatch |
| "Explain the 80/10/10 strategy" | Mitigates train-test mismatch. Random replacement forces representation of all tokens. Keep-original bridges to fine-tuning. |
| "Why was NSP removed?" | Too easy (topic matching), confounds topic/coherence, RoBERTa showed it hurts |
| "BERT vs GPT?" | BERT: bidirectional, better for understanding. GPT: unidirectional, better for generation. Fundamental tradeoff. |
| "How is BERT fine-tuned?" | Small learning rate (2e-5), 2-4 epochs, task-specific head on [CLS] or token representations |
| "BERT for production?" | Distillation (DistilBERT), domain pre-training, 512 token limit → chunking |
| "What is WordPiece?" | Subword tokenization that handles OOV words by decomposing into known subwords |
| "BERT's main limitation?" | 512 token limit, slow for generation, 15% prediction efficiency, not suitable for generative tasks |
Spaced Repetition Checkpoints
Day 0 (Today)
- Understand MLM and the 80/10/10 strategy
- Understand why bidirectional context matters
- Draw BERT's architecture from memory
Day 3
- Explain BERT vs GPT tradeoffs from memory
- Describe the pre-training + fine-tuning paradigm
- Explain why NSP was removed in RoBERTa
Day 7
- Practice a 10-minute BERT presentation
- Cite specific benchmark results
- Explain fine-tuning for 3 different task types
Day 14
- Compare BERT, RoBERTa, ALBERT, DeBERTa
- Discuss production deployment challenges
- Do a mock paper discussion on BERT
Day 21
- Full mock interview covering BERT and Transformer
- Handle surprise follow-up questions
- Connect BERT to modern LLM developments
Next Steps
You now understand both the Transformer architecture and how BERT adapted its encoder for bidirectional pre-training. Next, explore the other side of the coin with Chapter 5: GPT Series - how the Transformer decoder was scaled from GPT-1's 117M parameters to GPT-4's rumored trillion+, and how the paradigm shifted from fine-tuning to in-context learning.
