Understand how BERT learns bidirectional language representations using masked language modeling, its architecture, and how to fine-tune it for downstream tasks.

How does masked language modeling work in practice?

Masked Language Modeling and BERT covers BERT, masked language modeling, MLM from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/pretraining-and-finetuning/masked-language-modeling-bert

What is the difference between BERT and MLM?

See the full breakdown at https://engineersofai.com/docs/llms/pretraining-and-finetuning/masked-language-modeling-bert

Masked Language Modeling and BERT

The Classification Problem

It is early 2019. A team at a large e-commerce company needs to build a system that can classify customer support tickets - 47 categories, millions of tickets per week. They try a convolutional network trained on labeled tickets. It works, but requires 50,000 labeled examples to perform decently. Labeling costs $2 per ticket. That is$ 100,000 just to train the classifier.

A new engineer on the team suggests trying something different. She has been reading about BERT - a model released by Google in October 2018 that was pretrained on the entirety of Wikipedia and BookCorpus. The model already knows, after pretraining, what words tend to appear in customer service contexts, what "refund" and "shipping" and "damaged" mean in relation to each other. Fine-tuning it requires only a linear layer on top.

She runs a quick experiment. With just 500 labeled examples and two hours of fine-tuning on a single GPU, the BERT-based classifier achieves higher accuracy than the CNN trained on 50,000 examples. The reason is simple: BERT already knows a tremendous amount about language. The task-specific fine-tuning is just the last mile.

This was not an unusual result. BERT shattered NLP benchmarks across 11 tasks when it was released - achieving state-of-the-art on GLUE, SQuAD 1.1, SQuAD 2.0, and SWAG simultaneously. The secret was not its architecture (it was basically half of the original transformer) or its size (the base model had 110M parameters, smaller than many contemporaries). The secret was the training objective: masked language modeling.

Why This Exists: The Problem with Left-to-Right Models

Before BERT, the dominant approach for pretraining was left-to-right language modeling - predicting each token from only the tokens before it. ELMo (Peters et al., 2018) ran two separate LSTMs (one forward, one backward) and concatenated their representations. But these representations were concatenated, not jointly computed. Each direction was trained independently.

The problem: when you are trying to understand a word in context, you need both sides simultaneously. Consider the sentence:

"The bank can guarantee deposits will eventually cover future tuition costs because it invests in a safe portfolio of government bonds."

To understand whether "bank" means a financial institution or a riverbank, you need both the left context ("deposits") and the right context ("invests", "government bonds"). ELMo's bidirectional LSTM saw both directions but could not let the word "bank" attend to both simultaneously during representation learning. The two directions were separate.

BERT's insight: train a transformer that is fundamentally bidirectional by masking tokens and predicting them from full context. The model cannot "see" the masked token, so it must use both left and right context to predict it. This forces the model to build representations that are genuinely bidirectional.

Historical Context: The BERT Paper

BERT - Bidirectional Encoder Representations from Transformers - was published by Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova at Google AI Language in October 2018 (Devlin et al., 2018).

The paper's key contributions:

Demonstrated that a pretrained bidirectional transformer outperforms unidirectional and shallowly bidirectional pretrained models
Showed that the same pretrained model can be fine-tuned for wildly different tasks with minimal task-specific architecture changes
Established the masked language modeling objective as highly effective for learning bidirectional representations
Released BERT-Base and BERT-Large, making the pretrained weights publicly available

The timing was important. The original transformer (Vaswani et al., 2017) had just shown that attention was all you needed. GPT (Radford et al., 2018) had demonstrated that pretraining a transformer decoder with CLM transferred well to downstream tasks. BERT asked: what if instead of using half the transformer (just the decoder), we used the other half (just the encoder), and trained it with a different objective that unlocks bidirectionality?

The 15% Masking Strategy

The BERT masking procedure is more nuanced than simply replacing 15% of tokens with [MASK]. Here is the exact procedure:

For each training sequence:

Select 15% of token positions at random (excluding [CLS] and [SEP] special tokens)
For each selected position:
- 80% of the time: replace the token with [MASK]
- 10% of the time: replace with a random token from the vocabulary
- 10% of the time: keep the original token unchanged

Why not just replace all selected tokens with [MASK]?

If every selected token were replaced with [MASK], the model would learn to only produce useful representations for [MASK] tokens. At fine-tuning time, there are no [MASK] tokens - the model sees real tokens in all positions. This creates a train/test mismatch (a distribution shift between pretraining and fine-tuning).

The 10% random replacement forces the model to maintain a good representation for every token, even if it is not masked - because the model cannot know which tokens might have been randomly replaced. The 10% unchanged case forces the model to attend to the context to determine whether a token is correct or was randomly substituted.

The cost of this trick: Only 15% x 80% = 12% of tokens are actually masked with [MASK]. The model is trained to predict all 15%, but only sees [MASK] for 12% of them. This means BERT is only directly trained on 15% of input tokens per step - making it less sample-efficient than CLM, which trains on every token.

BERT Architecture

BERT comes in two sizes, both using the original transformer encoder (no decoder):

Parameter	BERT-Base	BERT-Large
Transformer layers	12	24
Attention heads	12	16
Hidden dimension	768	1024
Feed-forward dimension	3072	4096
Total parameters	110M	340M
Pretraining data	16GB	16GB

The architecture is pure transformer encoder - all layers are bidirectional. The model processes the full input sequence in one forward pass, with each token attending to every other token.

Special Tokens:

BERT uses two special tokens:

[CLS] - prepended to every input. Its final hidden state is used as the aggregate sequence representation for classification tasks. After fine-tuning, the [CLS] representation "absorbs" information about the entire sequence.
[SEP] - separates sentence pairs and marks the end of a sequence. For single-sentence inputs, it appears at the end. For sentence-pair inputs, it separates the two sentences.

Input representation = Token embeddings + Segment embeddings + Position embeddings. The segment embeddings distinguish Sentence A from Sentence B in NSP training (Segment A = embedding vector A, Segment B = embedding vector B). This is learned, not handcrafted.

The WordPiece Tokenizer

BERT uses WordPiece tokenization, which handles unknown words by breaking them into subword units.

WordPiece builds a vocabulary by iteratively merging the most frequent pair of adjacent units, with a twist: it maximizes the likelihood of the training data under the language model, rather than just merging by raw frequency (that is BPE).

"unaffable" becomes ["un", "##aff", "##able"] (## prefix means continuation)
"tokenization" becomes ["token", "##ization"]
"ChatGPT" (unseen during pretraining) might become ["Chat", "##G", "##PT"]

BERT-Base uses a vocabulary of 30,522 tokens for English. The [UNK] token is rare because almost any text can be broken into known subword units.

How BERT is Used for Downstream Tasks

The key insight of BERT: fine-tune, do not extract features. Run the full BERT model on your task-specific input and add a minimal task-specific layer on top. The entire network is fine-tuned together.

For classification: take the final hidden state of [CLS] (a vector of size 768 for BERT-Base), pass it through a linear layer mapping to the number of classes, and compute cross-entropy loss. Fine-tune the entire network.

For token classification (NER, POS tagging): take the final hidden state of each token, pass each through a linear layer mapping to the number of entity/tag types.

For extractive QA: represent the task as predicting start and end positions of the answer span in the passage. Train two linear layers (one for start, one for end) on top of the token representations.

RoBERTa: What BERT Got Wrong

Liu et al. (2019) published RoBERTa (Robustly optimized BERT approach), which retrained BERT with several key changes and showed dramatically better performance. The changes revealed what BERT's original training was sub-optimal about:

Remove NSP: Next Sentence Prediction was removed. Model trained on full contiguous sequences instead of sentence pairs. This alone improved performance.
Dynamic masking: BERT used static masking - the mask pattern was fixed when the data was preprocessed, so the model saw the same masks repeatedly. RoBERTa generates a new mask for each epoch, so the model sees each sentence with different masks. More training signal.
Larger batch sizes: BERT used batch size 256. RoBERTa used batch sizes of 2,000 to 8,000. Larger batches with correspondingly higher learning rates produced better models.
More data and longer training: BERT trained on 16GB of text (Wikipedia + BookCorpus). RoBERTa added CC-News, OpenWebText, and Stories, totaling 160GB. Training 10x longer on 10x data.
Longer sequences: More training on 512-token sequences rather than mixing 128 and 512.

The result: RoBERTa-Base matches or exceeds BERT-Large on many tasks, despite having the same architecture as BERT-Base.

note

The lesson from RoBERTa: the original BERT was undertrained. The architecture was fine - the training recipe was not optimal. This is a common pattern in deep learning research: architectural innovations often turn out to matter less than training recipe improvements.

Code: MLM Training with HuggingFace

"""
BERT-style Masked Language Modeling training.

This shows:
1. How to prepare data for MLM
2. The DataCollatorForLanguageModeling for automatic masking
3. A full training loop
4. Fill-mask inference
"""

import torch
import math
from transformers import (
    BertTokenizer,
    BertForMaskedLM,
    DataCollatorForLanguageModeling,
    TrainingArguments,
    Trainer,
)
from datasets import Dataset


# ---- 1. Prepare data ----
def prepare_mlm_dataset(texts, tokenizer, max_length=128):
    """Tokenize a list of texts for MLM training."""
    def tokenize(examples):
        return tokenizer(
            examples["text"],
            truncation=True,
            max_length=max_length,
            padding="max_length",
            return_special_tokens_mask=True,
        )
    dataset = Dataset.from_dict({"text": texts})
    return dataset.map(tokenize, batched=True, remove_columns=["text"])


# ---- 2. Load model and tokenizer ----
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForMaskedLM.from_pretrained(model_name)

# ---- 3. Data collator handles masking automatically ----
# Applies the 80/10/10 split internally
# Sets labels=-100 for non-masked positions
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=True,
    mlm_probability=0.15,
)

# ---- 4. Training arguments ----
training_args = TrainingArguments(
    output_dir="./bert-mlm-output",
    num_train_epochs=3,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    warmup_steps=500,
    weight_decay=0.01,
    logging_steps=100,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    fp16=True,
    learning_rate=5e-5,
    lr_scheduler_type="cosine",
)

# ---- 5. Fill-mask inference ----
def fill_mask(text_with_mask, model, tokenizer, top_k=5):
    """Use trained MLM to predict masked tokens."""
    inputs = tokenizer(text_with_mask, return_tensors="pt")

    mask_token_index = (
        inputs["input_ids"] == tokenizer.mask_token_id
    ).nonzero(as_tuple=True)[1]

    with torch.no_grad():
        outputs = model(**inputs)

    logits = outputs.logits
    results = []
    for idx in mask_token_index:
        mask_logits = logits[0, idx, :]
        top_tokens = torch.topk(mask_logits, top_k)
        predictions = [
            {"token": tokenizer.decode([tid]).strip(), "score": score.item()}
            for score, tid in zip(top_tokens.values, top_tokens.indices)
        ]
        results.append(predictions)
    return results


# Demo
model.eval()
text = "The [MASK] sat on the mat."
preds = fill_mask(text, model, tokenizer)
print(f"Input: {text}")
for p in preds[0]:
    print(f"  '{p['token']}': {p['score']:.4f}")

Fine-tuning BERT for Classification

"""
Fine-tune BERT for text classification.
Standard approach: add linear head on [CLS], fine-tune the entire network.
"""

import torch
import torch.nn as nn
import numpy as np
from transformers import (
    BertModel,
    BertForSequenceClassification,
    BertTokenizer,
    TrainingArguments,
    Trainer,
)
from sklearn.metrics import accuracy_score, f1_score


class BertClassifier(nn.Module):
    """
    Explicit BERT classifier showing the internals.
    BertForSequenceClassification does this automatically.
    """
    def __init__(self, model_name, num_classes, dropout=0.1):
        super().__init__()
        self.bert = BertModel.from_pretrained(model_name)
        self.dropout = nn.Dropout(dropout)
        # Map [CLS] hidden state (768-dim) to class logits
        self.classifier = nn.Linear(self.bert.config.hidden_size, num_classes)

    def forward(self, input_ids, attention_mask, labels=None):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        # [CLS] is always at position 0
        cls_output = outputs.last_hidden_state[:, 0, :]  # (batch, 768)
        cls_output = self.dropout(cls_output)
        logits = self.classifier(cls_output)             # (batch, num_classes)

        loss = None
        if labels is not None:
            loss = nn.functional.cross_entropy(logits, labels)

        return {"loss": loss, "logits": logits}


# Using built-in (recommended in practice)
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=2,
    hidden_dropout_prob=0.1,
    attention_probs_dropout_prob=0.1,
)

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Fine-tuning hyperparameters from the BERT paper
training_args = TrainingArguments(
    output_dir="./bert-classifier",
    num_train_epochs=3,          # Rarely need more than 3
    per_device_train_batch_size=16,
    learning_rate=2e-5,          # Try 2e-5, 3e-5, or 5e-5
    warmup_ratio=0.06,           # 6% of steps for learning rate warmup
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
)


def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {
        "accuracy": accuracy_score(labels, predictions),
        "f1": f1_score(labels, predictions, average="weighted"),
    }

Production Engineering Notes

Domain-Specific BERT Variants

For domain-specific applications, use a domain-pretrained BERT variant:

BioBERT (Lee et al., 2019) - pretrained on PubMed abstracts and PMC full texts
FinBERT - pretrained on financial news and SEC filings
LegalBERT - pretrained on legal documents
ClinicalBERT - pretrained on clinical notes

These significantly outperform general BERT on domain-specific tasks because the vocabulary distributions and semantic relationships differ from general web text.

Pooling Strategies for Sentence Embeddings

For sentence embeddings using BERT, the [CLS] token is a common choice but not always optimal. Alternatives:

Mean pooling: average all token representations (excluding [PAD]). Often better than [CLS] for similarity.
Max pooling: take the element-wise max across all token representations.
Layer combination: average representations from the last N layers.

SBERT (Reimers and Gurevych, 2019) showed that fine-tuning BERT with a siamese network for sentence similarity using mean pooling significantly outperforms raw [CLS] for semantic similarity.

Long Document Handling

BERT's maximum sequence length is 512 tokens. For longer documents:

Truncation: take the first 512 tokens. Works well when the answer/classification signal is near the beginning.
Sliding window: run BERT over overlapping windows of 512 tokens and aggregate predictions.
Longformer/BigBird: use sparse attention to handle 4096+ tokens.
Hierarchical: encode sentences separately, then encode sentence representations.

warning

Fine-tuning learning rate selection BERT is sensitive to learning rate. Too high (above 1e-4) and the pretrained knowledge is destroyed - catastrophic forgetting. Too low (below 1e-6) and fine-tuning does not converge in 3 epochs. The recommended range is 1e-5 to 5e-5. Always use warmup (typically 6-10% of total steps) - starting with the full learning rate on pretrained weights causes early instability.

Common Mistakes

danger

Not unfreezing all layers during fine-tuning A common mistake from other transfer learning contexts (like ImageNet) is to freeze the base model and only train the head initially. With BERT, this almost always underperforms. BERT fine-tuning requires updating the pretrained weights to adapt the representations to the target task. Fine-tune the entire model - the learning rate is small enough to make gentle updates rather than destroying the pretrained knowledge.

danger

Using raw BERT representations for semantic similarity Raw BERT representations are poor sentence embeddings for semantic similarity. They were optimized for masked token prediction, not sentence-level similarity. Cosine similarity between semantically similar sentences using raw BERT embeddings is often not meaningfully higher than between unrelated sentences. Either fine-tune on a similarity task or use SBERT.

warning

Assuming BERT outputs from different layers are equivalent Earlier BERT layers contain more syntactic features (POS, dependency structure). Later layers contain more semantic features. For token classification tasks like NER, some work finds averaging the last 4 layers beats using only the last layer. Experiment with which layer combination works best for your specific task.

tip

Dynamic padding for inference efficiency During inference, batch sequences to the maximum length in that batch, not the global maximum of 512. If your inputs average 80 tokens but you pad everything to 512, you are doing 6x unnecessary computation. Use padding="longest" in the tokenizer call during inference.

Interview Q&A

Q1: Why does BERT replace some masked tokens with random words instead of always using [MASK]?

The 80/10/10 split addresses a train/test mismatch. During fine-tuning, the model never sees [MASK] tokens - all positions have real tokens. If the model was trained to only produce useful representations at [MASK] positions, it would not learn to maintain good representations at non-masked positions. The 10% random replacement forces the model to build useful representations for every token, because any token might be wrong. The 10% unchanged case provides a similar forcing function - the model must rely on context to understand every token, not just the marked positions.

Q2: What is the difference between BERT and RoBERTa, and when would you choose RoBERTa?

RoBERTa is BERT with a better training recipe: no NSP, dynamic masking, larger batches, 10x more data, longer training. The architecture is identical. In practice, always prefer RoBERTa over BERT for new projects - it strictly dominates on downstream tasks. Choose RoBERTa-Base as your default BERT-family starting point. Use domain-specific variants (BioBERT, FinBERT) when working in specialized domains where vocabulary and semantic relationships differ from general web text.

Q3: How does BERT handle sentence pair tasks like natural language inference?

BERT concatenates the two sentences with a [SEP] separator token: [CLS] sentence_A [SEP] sentence_B [SEP]. The model processes this concatenated sequence. Each token can attend to all tokens in both sentences (bidirectional attention), so BERT can model cross-sentence relationships directly. For NLI (classify relationship as entailment/neutral/contradiction), the [CLS] representation is passed to a 3-class linear layer. This concatenation approach is simple but requires both sentences to fit within 512 tokens combined.

Q4: What has replaced BERT as the dominant NLP architecture?

Instruction-tuned causal language models (GPT-4, Claude, LLaMA) have largely replaced BERT-family models for most NLP tasks. With few-shot or zero-shot prompting, a 7B CLM model can perform classification, NER, QA, and summarization without any fine-tuning - just by describing the task in the prompt. However, for embedding tasks (semantic search, clustering, similarity), BERT-family models encoded as sentence transformers (SBERT, E5, GTE) remain competitive because CLM models are computationally expensive to use as encoders. The current landscape: CLM models for generation and instruction-following, sentence transformers (BERT-family) for dense embeddings.

Q5: What is catastrophic forgetting in the context of BERT fine-tuning?

Catastrophic forgetting is when training on a new task overwrites what a model previously learned. With BERT, this risk comes from fine-tuning too aggressively - too high a learning rate or too many epochs can overwrite the pretrained representations. Mitigations include: using a small learning rate (1e-5 to 5e-5) to make gentle updates; warmup so the first steps are even smaller; fine-tuning for only 3 epochs (more usually hurts); layer-wise learning rate decay with lower rates for earlier layers. Modern approaches use LoRA (Lesson 07), which keeps pretrained weights frozen and only trains small adapter matrices, making catastrophic forgetting essentially impossible.

Advanced BERT Variants

BERT spawned an entire family of encoder models, each fixing specific limitations. Understanding the landscape helps you choose the right backbone for your use case.

DeBERTa - Disentangled Attention (2021)

Microsoft's DeBERTa (He et al., 2021) introduced disentangled attention: instead of representing each token with a single vector, it uses two vectors - one for content, one for position. The attention score between token $i$ and $j$ is computed as a sum of four terms:

$\text{Attention}(i,j) = \text{Content}(i) \cdot \text{Content}(j) + \text{Content}(i) \cdot \text{Position}(j) + \text{Position}(i) \cdot \text{Content}(j)$

The fourth term (position-to-position) is dropped because relative positions already encode this. This disentangled representation lets the model better capture the semantic relationship between a word and its position, independent of the word's meaning. DeBERTa-v3 achieved SOTA on GLUE and SuperGLUE benchmarks, outperforming BERT, RoBERTa, and even GPT-3 on structured NLU tasks at the same parameter scale.

from transformers import DebertaV2Tokenizer, DebertaV2ForSequenceClassification
import torch

# DeBERTa-v3 is the current best BERT-family model for classification
model_name = "microsoft/deberta-v3-base"
tokenizer = DebertaV2Tokenizer.from_pretrained(model_name)
model = DebertaV2ForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,
)

def classify_text(text: str, label_names: list[str]) -> dict:
    inputs = tokenizer(
        text,
        return_tensors="pt",
        max_length=512,
        truncation=True,
        padding=True,
    )
    with torch.no_grad():
        logits = model(**inputs).logits
    probs = torch.softmax(logits, dim=-1).squeeze()
    predicted = probs.argmax().item()
    return {
        "label": label_names[predicted],
        "confidence": probs[predicted].item(),
        "all_probs": {label_names[i]: p.item() for i, p in enumerate(probs)},
    }

result = classify_text(
    "The product quality is excellent but shipping was very slow.",
    label_names=["negative", "positive"],
)
print(result)
# {"label": "positive", "confidence": 0.71, ...}

ELECTRA - Replaced Token Detection (2020)

Clark et al. (2020) at Google Brain asked: what if instead of predicting the masked token, we train the model to detect which tokens were replaced by a generator?

The ELECTRA architecture has two components:

Generator: a small MLM (usually 1/4 the size of the discriminator) that fills in masked tokens - but it generates plausible replacements, not necessarily the correct ones.
Discriminator: a full-size model trained to classify every single token as "original" or "replaced."

Input:   The chef [MASK] the delicious [MASK] .
Generator output:  The chef cooked the delicious meal .   (plausible but maybe wrong)
Discriminator sees: [original] [original] [replaced] [original] [original] [original] [original]

The key insight: the discriminator trains on all tokens, not just 15%. This 4–5x increase in training signal makes ELECTRA models converge 4x faster than BERT and outperform BERT at the same compute budget. At inference time, discard the generator and use the discriminator as your encoder - its representations are better than BERT's because it was trained on a harder signal.

from transformers import ElectraTokenizer, ElectraForSequenceClassification

# ELECTRA-small is better than BERT-base with 1/4 the parameters
tokenizer = ElectraTokenizer.from_pretrained("google/electra-small-discriminator")
model = ElectraForSequenceClassification.from_pretrained(
    "google/electra-small-discriminator",
    num_labels=3,
)

Sentence Transformers - BERT for Dense Embeddings

Standard BERT was never designed to produce good sentence embeddings. If you average the token embeddings from the last layer (mean pooling), you get surprisingly poor results for semantic similarity tasks. The problem: BERT's training objective (MLM + NSP) does not require the pooled [CLS] representation to encode the sentence's overall meaning in a way that is similar to other semantically similar sentences.

Sentence-BERT (Reimers & Gurevych, 2019) fixed this with a siamese architecture. Two BERT encoders share weights. For a pair of sentences, each is encoded independently, mean-pooled, and then the cosine similarity is trained to match human-labeled semantic similarity scores (0–5 scale).

from sentence_transformers import SentenceTransformer
import numpy as np

# Modern sentence transformers - much better than raw BERT for embeddings
model = SentenceTransformer("BAAI/bge-large-en-v1.5")  # Best open-source as of 2024

def compute_semantic_similarity(texts: list[str], queries: list[str]) -> np.ndarray:
    """Compute cosine similarity between queries and corpus texts."""
    corpus_embeddings = model.encode(texts, normalize_embeddings=True)
    query_embeddings = model.encode(queries, normalize_embeddings=True)
    # With normalized embeddings, dot product = cosine similarity
    return query_embeddings @ corpus_embeddings.T

corpus = [
    "The cat sat on the mat.",
    "A feline rested on the rug.",
    "Python is a programming language.",
    "Machine learning requires large datasets.",
]
queries = ["Where was the cat?", "What is Python?"]

similarities = compute_semantic_similarity(corpus, queries)
for i, query in enumerate(queries):
    best_idx = similarities[i].argmax()
    print(f"Query: {query}")
    print(f"Best match: {corpus[best_idx]} (score: {similarities[i, best_idx]:.3f})")
    print()

The current state-of-the-art for open-source embeddings:

BGE (BAAI/bge-large-en-v1.5): Top MTEB score for English, good balance of speed and quality
E5 (intfloat/e5-large-v2): Microsoft's embedding model, strong retrieval performance
GTE (thenlper/gte-large): Alibaba's embedding model, excellent multilingual support
Nomic Embed (nomic-ai/nomic-embed-text-v1): Open-source, Apache 2.0 license, 8192 context

For RAG pipelines, always use a sentence transformer from the MTEB leaderboard rather than raw BERT token embeddings - the difference in retrieval quality is substantial (e.g., 15–25% higher recall@10 on standard benchmarks).

Production Deployment Patterns

Efficient Batch Inference

BERT inference bottleneck: tokenization and padding. When sentences have very different lengths, short sentences in a batch are padded to the length of the longest - wasting compute.

from transformers import AutoTokenizer, AutoModel
import torch
from torch.nn.utils.rnn import pad_sequence

def batch_encode_with_dynamic_padding(
    texts: list[str],
    model_name: str = "sentence-transformers/all-MiniLM-L6-v2",
    batch_size: int = 64,
) -> torch.Tensor:
    """Encode texts with dynamic padding per batch - more efficient than fixed-length padding."""
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)
    model.eval()

    all_embeddings = []

    # Sort texts by length to minimize padding waste (bucket batching)
    indexed_texts = sorted(enumerate(texts), key=lambda x: len(x[1]), reverse=True)
    sorted_texts = [t for _, t in indexed_texts]
    original_indices = [i for i, _ in indexed_texts]

    for i in range(0, len(sorted_texts), batch_size):
        batch = sorted_texts[i:i + batch_size]
        # Tokenize without padding - pad to max within this batch only
        encoded = tokenizer(
            batch,
            padding=True,          # Pad to max in THIS batch, not global max
            truncation=True,
            max_length=512,
            return_tensors="pt",
        )
        with torch.no_grad():
            output = model(**encoded)
        # Mean pooling with attention mask
        mask = encoded["attention_mask"].unsqueeze(-1).float()
        embeddings = (output.last_hidden_state * mask).sum(dim=1) / mask.sum(dim=1)
        all_embeddings.append(embeddings)

    # Reassemble in original order
    embeddings_sorted = torch.cat(all_embeddings, dim=0)
    embeddings_original = torch.empty_like(embeddings_sorted)
    for new_idx, orig_idx in enumerate(original_indices):
        embeddings_original[orig_idx] = embeddings_sorted[new_idx]

    return embeddings_original

ONNX Export for Production Speed

Converting BERT to ONNX format gives 2–4x inference speedup on CPU and unlocks TensorRT acceleration on GPU:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from optimum.onnxruntime import ORTModelForSequenceClassification
import numpy as np

def export_bert_to_onnx(model_name: str, save_dir: str):
    """Export fine-tuned BERT to ONNX for fast inference."""
    # Use HuggingFace Optimum for ONNX export
    model = ORTModelForSequenceClassification.from_pretrained(
        model_name,
        export=True,  # Triggers ONNX export on first load
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model.save_pretrained(save_dir)
    tokenizer.save_pretrained(save_dir)
    print(f"ONNX model saved to {save_dir}")
    return model, tokenizer

def run_onnx_inference(save_dir: str, texts: list[str]) -> np.ndarray:
    """Run inference with the exported ONNX model."""
    model = ORTModelForSequenceClassification.from_pretrained(save_dir)
    tokenizer = AutoTokenizer.from_pretrained(save_dir)

    inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
    outputs = model(**inputs)
    probs = outputs.logits.softmax(dim=-1).numpy()
    return probs

tip

Choosing the right BERT-family model for your task

Task	Recommended Model	Why
Text classification	`microsoft/deberta-v3-base`	Best accuracy on GLUE
Fast classification (latency-critical)	`google/electra-small-discriminator`	14M params, near BERT-base quality
Semantic search / RAG	`BAAI/bge-large-en-v1.5`	Top MTEB retrieval score
Multilingual tasks	`intfloat/multilingual-e5-large`	100+ languages
NER (named entity recognition)	`dslim/bert-large-NER`	Fine-tuned on CoNLL-2003
Question answering	`deepset/roberta-base-squad2`	Fine-tuned on SQuAD 2.0

Common Mistakes

danger

Using [CLS] embedding from stock BERT as a sentence embedding

The [CLS] token from stock (not sentence-transformer fine-tuned) BERT is a poor sentence embedding. BERT was not trained to put global sentence meaning into [CLS]. If you need sentence embeddings, use a sentence transformer model (SBERT, BGE, E5). Using raw BERT [CLS] vectors for semantic search typically gives results barely better than BM25 (keyword matching).

warning

Forgetting that BERT's max sequence length is 512 tokens

If your documents are long (legal contracts, research papers, medical records), you must handle length. Options: (1) truncate to 512 and accept information loss - only valid if the relevant content is always at the start; (2) sliding window: encode overlapping chunks of 512 tokens and aggregate representations (max or mean pooling); (3) use a long-context model instead (Longformer with 4096 context, BigBird with 4096, or a causal LM with 32K+ context for generation tasks). Most production systems use option 2 for classification over long documents.

warning

Using a learning rate too high for fine-tuning

BERT fine-tuning is notoriously sensitive to learning rate. With lr=1e-3 (a common default for neural networks), BERT fine-tuning will diverge - the pretrained representations are destroyed immediately. Always use lr between 1e-5 and 5e-5. Add a warmup schedule: linearly increase from 0 to peak lr over the first 6% of training steps, then linearly decay. Without warmup, the first batch can cause large gradient updates that destabilize early layers.

Key Takeaways

BERT introduced a training objective (masked language modeling) that produces richer bidirectional representations than any previous model, because each prediction requires integrating context from both directions simultaneously. This led to a step-change improvement on nearly every NLP benchmark in 2018.

The BERT family has largely been superseded by instruction-tuned causal LMs for task completion, but encoder models remain the dominant choice for embedding tasks: semantic search, clustering, anomaly detection, and retrieval-augmented generation. The sentence transformer ecosystem (BGE, E5, GTE) built on top of BERT's bidirectional representations is a core component of nearly every modern RAG pipeline deployed in production today.

When you next see a vector database, a semantic search endpoint, or an embedding API, there is a very high probability that a BERT-family encoder is sitting underneath it - quietly producing the representations that make it work.

Interview Q&A (Extended)

Q6: What is the difference between BERT and DeBERTa, and when would you choose DeBERTa?

DeBERTa (He et al., 2021) improves on BERT with two main innovations: (1) Disentangled attention - each token is represented by two vectors (content + position) and attention is computed across four interaction terms (content-to-content, content-to-position, position-to-content, position-to-position with the last dropped). This lets the model better capture how a word's meaning depends on its position independently from its content. (2) Enhanced mask decoder - DeBERTa uses absolute position information only at the final softmax layer for masked token prediction, keeping the transformer layers fully relative. In practice, DeBERTa-v3 consistently outperforms RoBERTa by 1–3 points on GLUE. Choose DeBERTa when classification quality is the primary concern and inference speed is secondary. For latency-critical production (serving thousands of requests per second), the smaller ELECTRA-small or MiniLM is often a better trade-off.

Q7: How does BERT fine-tuning work for token classification tasks like NER?

For Named Entity Recognition with BERT, every input token needs a label (not just [CLS]). The architecture adds a linear classification head on top of BERT's last layer hidden states - one linear layer per token that maps from hidden_size (768) to num_labels. BERT is then fine-tuned end-to-end on labeled token sequences. One subtlety: WordPiece tokenization splits words into subwords ("running" → ["run", "##ning"]). You must decide how to handle subword labels - the standard approach is to assign the label to the first subword token and use -100 (ignore index) for continuation tokens ##.... The model learns to produce entity predictions on first subwords, which are then re-assembled to word-level predictions at inference time.

Q8: What has happened to BERT in the era of large instruction-tuned models?

BERT's role has shifted rather than disappeared. For most NLP classification tasks (sentiment analysis, topic classification, intent detection), a fine-tuned BERT-family model remains the most efficient choice - orders of magnitude cheaper to serve than a 7B+ LLM, with competitive quality on well-defined classification problems. The areas where BERT has been genuinely displaced: (1) text generation tasks where CLMs are the only option; (2) zero-shot or few-shot tasks where LLMs with in-context learning outperform fine-tuned BERT without any labeled data; (3) complex reasoning tasks (multi-hop QA, math) where larger models have fundamentally better capabilities. The clearest BERT stronghold remains dense retrieval - sentence transformers (SBERT, BGE, E5) built on BERT architecture are still the first-choice encoding layer for RAG systems, because CLMs are expensive to run as encoders for millions of documents.

Q9: Why does BERT use WordPiece tokenization and how does it handle unknown words?

WordPiece (Schuster and Nakawatase, 2012) is a subword tokenization algorithm that splits rare words into subword units. Starting with individual characters, it repeatedly merges the pair that maximizes the language model likelihood on the training corpus (similar to BPE but using likelihood rather than frequency). The result: common words like "the" and "is" get their own tokens, while rare words like "uncharacteristically" are split into known subwords: ["un", "##character", "##istic", "##ally"]. The ## prefix marks continuation subwords. Benefits: (1) no unknown words - any word can be decomposed into characters if needed; (2) compact vocabulary (30,000 tokens for BERT) that covers English well; (3) morphological awareness - the model can generalize across "run"/"running"/"runs" because they share the root subword. Modern tokenizers (BPE-based: GPT-2, LLaMA, Mistral) use similar principles but without the ## prefix convention.

Q10: What is the impact of pretraining corpus composition on BERT's downstream performance?

BERT was pretrained on BooksCorpus (800M words) and English Wikipedia (2,500M words). This corpus choice has several consequences: (1) The model is biased toward formal, written English - it performs better on news articles and Wikipedia than on social media or informal text; (2) Strong performance on factual Q&A because Wikipedia training provides encyclopedic knowledge; (3) Poor performance on code, mathematics, and specialized scientific notation - these were not in the pretraining data. RoBERTa expanded to CC-News, OpenWebText, and Stories (160GB total), improving general language understanding. Domain-specific BERT variants address this directly: BioBERT trains on PubMed abstracts, LegalBERT on legal documents, FinBERT on financial news. If your application domain differs significantly from Wikipedia/books, consider a domain-specific variant or continue pretraining a general BERT on your domain corpus.

Continued Pretraining on Domain Data

When a general BERT-family model underperforms on your domain, continued pretraining (also called domain-adaptive pretraining, DAPT - Gururangan et al., 2020) can close the gap without training from scratch.

from transformers import (
    AutoTokenizer,
    AutoModelForMaskedLM,
    DataCollatorForLanguageModeling,
    TrainingArguments,
    Trainer,
)
from datasets import load_dataset, Dataset

def domain_adaptive_pretraining(
    base_model: str,
    domain_corpus_files: list[str],   # List of text file paths
    output_dir: str,
    mlm_probability: float = 0.15,
    num_train_epochs: int = 3,
    learning_rate: float = 5e-5,      # Lower than original pretraining
) -> str:
    """
    Continue pretraining a BERT-family model on domain-specific text.
    The model learns the vocabulary, terminology, and writing style of the domain
    while retaining its general language understanding.
    """
    tokenizer = AutoTokenizer.from_pretrained(base_model)
    model = AutoModelForMaskedLM.from_pretrained(base_model)

    # Load domain corpus
    all_texts = []
    for filepath in domain_corpus_files:
        with open(filepath) as f:
            texts = f.read().split("\n\n")  # Split on paragraph breaks
            all_texts.extend([t.strip() for t in texts if len(t.strip()) > 50])

    print(f"Loaded {len(all_texts)} domain text segments")

    # Tokenize
    def tokenize_function(examples):
        return tokenizer(
            examples["text"],
            truncation=True,
            max_length=512,
            padding=False,  # DataCollator handles padding dynamically
        )

    dataset = Dataset.from_dict({"text": all_texts})
    tokenized = dataset.map(tokenize_function, batched=True, remove_columns=["text"])

    # Data collator creates masked inputs dynamically
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=True,
        mlm_probability=mlm_probability,
    )

    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=num_train_epochs,
        per_device_train_batch_size=32,
        learning_rate=learning_rate,
        weight_decay=0.01,
        warmup_ratio=0.06,
        lr_scheduler_type="linear",
        bf16=True,
        logging_steps=100,
        save_strategy="epoch",
        evaluation_strategy="no",
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized,
        data_collator=data_collator,
    )

    trainer.train()
    trainer.save_model(output_dir)
    tokenizer.save_pretrained(output_dir)
    print(f"Domain-adapted model saved to {output_dir}")
    return output_dir

The DAPT finding (Gururangan et al., 2020): continued pretraining on as little as 12,000 domain documents gives 2–5% downstream task improvement. The benefit is largest for specialized domains (biomedical, legal, computer science) where vocabulary and writing conventions differ most from general web text. After DAPT, fine-tune on your labeled task data as usual - the domain-adapted model provides a better starting point.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the BERT Masked Language Modeling demo on the EngineersOfAI Playground - no code required.

:::

The Classification Problem​

Why This Exists: The Problem with Left-to-Right Models​

Historical Context: The BERT Paper​

The 15% Masking Strategy​

BERT Architecture​

The WordPiece Tokenizer​

How BERT is Used for Downstream Tasks​

RoBERTa: What BERT Got Wrong​

Code: MLM Training with HuggingFace​

Fine-tuning BERT for Classification​

Production Engineering Notes​

Common Mistakes​

Interview Q&A​

Advanced BERT Variants​

DeBERTa - Disentangled Attention (2021)​

ELECTRA - Replaced Token Detection (2020)​

Sentence Transformers - BERT for Dense Embeddings​

Production Deployment Patterns​

Efficient Batch Inference​

ONNX Export for Production Speed​

Common Mistakes​

Key Takeaways​

Interview Q&A (Extended)​

Continued Pretraining on Domain Data​

The Classification Problem

Why This Exists: The Problem with Left-to-Right Models

Historical Context: The BERT Paper

The 15% Masking Strategy

BERT Architecture

The WordPiece Tokenizer

How BERT is Used for Downstream Tasks

RoBERTa: What BERT Got Wrong

Code: MLM Training with HuggingFace

Fine-tuning BERT for Classification

Production Engineering Notes

Common Mistakes

Interview Q&A

Advanced BERT Variants

DeBERTa - Disentangled Attention (2021)

ELECTRA - Replaced Token Detection (2020)

Sentence Transformers - BERT for Dense Embeddings

Production Deployment Patterns

Efficient Batch Inference

ONNX Export for Production Speed

Common Mistakes

Key Takeaways

Interview Q&A (Extended)

Continued Pretraining on Domain Data