What is encoder only transformer?

Comparing encoder-only, decoder-only, and encoder-decoder transformer architectures - when to use each and why decoder-only won.

How does decoder only transformer work in practice?

Encoder vs Decoder vs Encoder-Decoder covers encoder only transformer, decoder only transformer, BERT GPT T5 from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/transformer-architecture/encoder-vs-decoder-vs-encoder-decoder

What is the difference between encoder only transformer and BERT GPT T5?

See the full breakdown at https://engineersofai.com/docs/llms/transformer-architecture/encoder-vs-decoder-vs-encoder-decoder

Encoder vs Decoder vs Encoder-Decoder

Reading time: ~35 min · Interview relevance: Essential · Target roles: ML Engineer, AI Engineer, Research Engineer

The Architecture War of 2018-2020

When the transformer paper dropped in 2017, everyone agreed it was important. Nobody agreed on how to use it.

Google's first major follow-up was BERT (2018) - encoder-only, bidirectional, pretrained by masking random tokens and predicting them. Within months, it shattered benchmarks across 11 NLP tasks simultaneously. The NLP community concluded: encoder-only transformers are the way to go for understanding tasks.

OpenAI's GPT (2018) took the opposite approach - decoder-only, autoregressive, trained to predict the next token. It performed well on generation tasks but was dismissed by many as less rigorous than BERT's bidirectional context. GPT-2 (2019) was impressive but still largely viewed as a curiosity. "Sure, it generates text - but BERT understands text."

Then Google released T5 (2019) - encoder-decoder, framing every NLP task as text-to-text. It argued that the full transformer architecture was most principled. Translation? Text-to-text. Classification? Text-to-text ("positive" or "negative"). Question answering? Text-to-text.

For two years, three architectures were competitive on different benchmark clusters. Then GPT-3 (2020) ended the debate. At 175 billion parameters, the decoder-only architecture, with no task-specific fine-tuning, outperformed specialized models on most tasks via few-shot prompting. The community's conclusion shifted: at scale, decoder-only is sufficient for everything.

Today, GPT-4, Claude, Gemini, LLaMA - the frontier models - are all decoder-only. But BERT-style encoders are still used extensively for embeddings, classification, and retrieval. T5-style encoder-decoders remain the best choice for structured sequence-to-sequence tasks. All three architectures are alive and serving different roles.

Understanding when to use each is a practical engineering skill, not just academic knowledge.

The Three Architectures

Encoder-Only: BERT and Its Descendants

Bidirectional attention: Every token attends to every other token - both past and future. When encoding "The bank near the river," the token "bank" can see "river" to its right while computing its representation.

Pretraining: Masked Language Modeling (MLM). Randomly mask 15% of tokens, train the model to predict them. The model must use both left and right context to fill in the mask - this forces bidirectional understanding.

What it's good at:

Classification (sentiment, NLI, toxicity detection)
Named entity recognition (sequence labeling)
Extractive question answering (find the answer span)
Sentence embeddings for retrieval and semantic search
Regression tasks

What it cannot do:

Generate text - there's no autoregressive next-token prediction
Tasks requiring generation of novel sequences

Famous models:

BERT-base: 12 layers, $d_{model}=768$ , 12 heads, ~110M parameters
BERT-large: 24 layers, $d_{model}=1024$ , 16 heads, ~340M parameters
RoBERTa (Facebook, 2019): BERT with better training (more data, no NSP, dynamic masking)
DeBERTa (Microsoft, 2020): BERT with disentangled attention (separating content and position)
ModernBERT (2024): BERT with 8K context, Flash Attention, RoPE - state of the art encoder

Still used today: Embeddings models (sentence-transformers, text-embedding-ada), retrieval-augmented generation (the retriever half), zero-shot classification, NER. Despite GPT's dominance for generation, encoder-only models are often the right choice for tasks requiring dense embeddings or classification - they're faster and cheaper.

Decoder-Only: GPT and the Modern LLM

Causal (unidirectional) attention: Each token can only attend to previous tokens, never future ones. This is enforced by a causal mask - a lower-triangular boolean matrix.

Why causal masking? Decoder-only models are trained to predict the next token given all previous tokens: $P(x_t | x_1, x_2, \ldots, x_{t-1})$ . This is autoregressive generation. If the model could see future tokens during training, it would simply copy them - no learning would happen.

Pretraining: Next-token prediction (Language Modeling). The model processes the full training sequence and predicts each token given all prior tokens. This is more data-efficient than MLM (uses 100% of tokens, not 15%).

The causal mask:

For a sequence of length $n$ , the mask is a lower-triangular matrix where position $(i, j)$ is True if $j \leq i$ (token $i$ can attend to token $j$ ):

Position:  0  1  2  3
Token 0:   T  F  F  F   (can only see itself)
Token 1:   T  T  F  F   (sees tokens 0 and 1)
Token 2:   T  T  T  F   (sees tokens 0, 1, 2)
Token 3:   T  T  T  T   (sees all previous tokens)

Famous models:

GPT-2: 12-48 layers, 117M to 1.5B parameters
GPT-3: 96 layers, $d_{model}=12288$ , 96 heads, 175B parameters
LLaMA-2: 32 or 80 layers, 7B to 70B parameters
Claude (Anthropic): decoder-only
Gemini (Google): decoder-only
GPT-4 (OpenAI): reportedly MoE decoder-only

Why decoder-only won: At sufficient scale, the autoregressive training objective is rich enough to learn every task. Give the model the task description in the prompt, and generation produces the answer. No need for a separate encoder - the decoder processes the prompt bidirectionally (in the sense that all prompt tokens are processed together), then generates autoregressively.

The Causal Mask: Implementation

import torch
import torch.nn as nn
import torch.nn.functional as F
import math


def create_causal_mask(seq_len: int, device: torch.device = None) -> torch.Tensor:
    """
    Create a causal attention mask.
    Returns: (1, 1, seq_len, seq_len) boolean tensor.
    True = position is allowed to attend to.

    Token i can attend to token j iff j <= i.
    """
    mask = torch.tril(torch.ones(seq_len, seq_len, dtype=torch.bool, device=device))
    return mask.unsqueeze(0).unsqueeze(0)  # (1, 1, seq, seq) for broadcasting


def causal_self_attention(
    x: torch.Tensor,
    W_qkv: torch.Tensor,
    W_o: torch.Tensor,
    num_heads: int,
) -> torch.Tensor:
    """
    Causal self-attention for decoder-only models.
    Every position can only attend to previous positions.
    """
    batch, seq, d_model = x.shape
    d_k = d_model // num_heads

    # Project to Q, K, V in one shot (efficient)
    qkv = x @ W_qkv  # (batch, seq, 3*d_model)
    Q, K, V = qkv.chunk(3, dim=-1)  # each (batch, seq, d_model)

    # Reshape for multi-head: (batch, heads, seq, d_k)
    def split_heads(t):
        return t.view(batch, seq, num_heads, d_k).transpose(1, 2)

    Q, K, V = split_heads(Q), split_heads(K), split_heads(V)

    # Scaled dot-product scores
    scores = Q @ K.transpose(-2, -1) / math.sqrt(d_k)  # (batch, heads, seq, seq)

    # Apply causal mask: -inf for future positions
    causal_mask = create_causal_mask(seq, device=x.device)  # (1, 1, seq, seq)
    scores = scores.masked_fill(~causal_mask, float('-inf'))

    # Softmax + weighted sum
    attn = F.softmax(scores, dim=-1)
    attn = torch.nan_to_num(attn)  # Handle -inf positions
    context = attn @ V  # (batch, heads, seq, d_k)

    # Concatenate heads
    context = context.transpose(1, 2).contiguous().view(batch, seq, d_model)
    return context @ W_o


# Show the causal mask visually
mask = create_causal_mask(6)
print("Causal mask (True = can attend):")
print(mask[0, 0].int())
# tensor([[1, 0, 0, 0, 0, 0],
#         [1, 1, 0, 0, 0, 0],
#         [1, 1, 1, 0, 0, 0],
#         [1, 1, 1, 1, 0, 0],
#         [1, 1, 1, 1, 1, 0],
#         [1, 1, 1, 1, 1, 1]])

Encoder-Decoder: T5 and Structured Generation

Structure: A full encoder stack + a full decoder stack, connected by cross-attention.

How it works:

The encoder reads the input (source) with bidirectional attention - full context
The encoder produces a sequence of contextualized representations, one per input token
The decoder generates output autoregressively (causal self-attention) + cross-attention to encoder output
Cross-attention: decoder queries, encoder keys and values - "what part of the source should I look at while generating this output token?"

Pretraining (T5): "Span corruption" - mask random spans of text (not individual tokens) and train the model to output the removed spans. This trains both the encoder (to understand masked input) and decoder (to generate the missing spans).

What it's good at:

Machine translation (the original application)
Summarization (source: article, target: summary)
Data-to-text (source: structured data, target: description)
Document Q&A (source: document + question, target: answer)
Tasks with a clear input sequence and a clear output sequence

Famous models:

T5 (Google, 2019): 60M to 11B parameters
BART (Facebook, 2019): Encoder-decoder with denoising pretraining
mT5: Multilingual T5
Flan-T5: T5 fine-tuned on instruction following

When to use encoder-decoder vs decoder-only: For production applications, decoder-only with prompting is almost always easier to deploy (one model, one inference path). Encoder-decoder is justified when: (1) you have a very structured source-to-target task, (2) you need the most efficient use of parameters for seq2seq, or (3) your task benefits from clear separation of understanding (encoder) and generation (decoder).

Cross-Attention: The Bridge in Encoder-Decoder

Cross-attention in the decoder is where the "translation" happens. It uses:

Queries (Q): from the decoder (what does the decoder currently need?)
Keys (K) and Values (V): from the encoder output (what did the source say?)

class CrossAttention(nn.Module):
    """
    Cross-attention: decoder attends to encoder output.
    Used in encoder-decoder models (T5, BART, original transformer).
    """

    def __init__(self, d_model: int, num_heads: int):
        super().__init__()
        assert d_model % num_heads == 0
        self.d_k = d_model // num_heads
        self.num_heads = num_heads

        # Note: Q comes from decoder, K and V from encoder
        self.W_q = nn.Linear(d_model, d_model, bias=False)  # decoder queries
        self.W_k = nn.Linear(d_model, d_model, bias=False)  # encoder keys
        self.W_v = nn.Linear(d_model, d_model, bias=False)  # encoder values
        self.W_o = nn.Linear(d_model, d_model, bias=False)

    def forward(
        self,
        decoder_hidden: torch.Tensor,    # (batch, seq_dec, d_model)
        encoder_output: torch.Tensor,    # (batch, seq_enc, d_model)
        encoder_mask: torch.Tensor = None,  # (batch, 1, 1, seq_enc)
    ) -> torch.Tensor:
        batch = decoder_hidden.size(0)

        def split_heads(x):
            seq = x.size(1)
            return x.view(batch, seq, self.num_heads, self.d_k).transpose(1, 2)

        Q = split_heads(self.W_q(decoder_hidden))    # (batch, h, seq_dec, d_k)
        K = split_heads(self.W_k(encoder_output))    # (batch, h, seq_enc, d_k)
        V = split_heads(self.W_v(encoder_output))    # (batch, h, seq_enc, d_k)

        scores = Q @ K.transpose(-2, -1) / math.sqrt(self.d_k)

        if encoder_mask is not None:
            scores = scores.masked_fill(encoder_mask == 0, float('-inf'))

        attn = F.softmax(scores, dim=-1)
        attn = torch.nan_to_num(attn)

        context = attn @ V  # (batch, h, seq_dec, d_k)
        context = context.transpose(1, 2).contiguous()
        context = context.view(batch, -1, self.num_heads * self.d_k)
        return self.W_o(context)  # (batch, seq_dec, d_model)


# Test cross-attention
batch, seq_enc, seq_dec, d_model, heads = 2, 12, 8, 64, 8

cross_attn = CrossAttention(d_model, heads)
encoder_out = torch.randn(batch, seq_enc, d_model)
decoder_hidden = torch.randn(batch, seq_dec, d_model)

output = cross_attn(decoder_hidden, encoder_out)
print(f"Cross-attention output: {output.shape}")  # (2, 8, 64)
print(f"Source seq len: {seq_enc}, Target seq len: {seq_dec}")

Architecture Decision Guide

Architecture Comparison Table

Feature	Encoder-Only	Decoder-Only	Encoder-Decoder
Attention type	Bidirectional	Causal (unidirectional)	Bi in encoder, causal in decoder
Pretraining task	MLM (masked tokens)	Next-token prediction	Span corruption / denoising
Output type	Representations	Tokens (autoregressive)	Tokens (autoregressive)
Use case	Understanding, classification	Generation, completion	Sequence-to-sequence
Example models	BERT, RoBERTa, DeBERTa	GPT-3/4, LLaMA, Claude	T5, BART
Inference speed	Fast (no KV cache needed)	Needs KV cache	Needs KV cache
Parameter efficiency	High (for understanding tasks)	Medium	High (for seq2seq)
Scales to LLM size?	Less common	Yes (175B+)	Yes (11B+)

Why Decoder-Only Won at Scale

Three reasons decoder-only architecture dominates frontier models:

1. Simpler training objective: Next-token prediction uses 100% of training tokens. MLM uses only 15% (the masked ones). At the scale of 1T tokens, this efficiency compounds dramatically.

2. Prompting subsumes all tasks: A decoder-only model with a large context window can simulate any task via prompting:

Classification: "Sentiment of this text: [text]. Answer: [positive/negative]" → generate the label
Translation: "Translate to French: [English text]\nFrench:" → generate the translation
Extraction: "Extract named entities from: [text]\nEntities:" → generate the list

Encoder-only can't generate; encoder-decoder requires separate training objectives.

3. KV caching enables efficient generation: The causal attention pattern means that during autoregressive generation, you only need to compute attention for the new token against cached keys/values. The decoder-only architecture is naturally optimized for this.

Where encoder-only wins: Tasks that require dense embeddings or classification at high throughput. Embedding models (used in RAG, semantic search) are almost exclusively encoder-only because they produce better-calibrated similarity scores and are cheaper to run.

Production Engineering Notes

Decoder-Only for Classification

A common question: should you use BERT or GPT-4 for text classification?

BERT-class models: 110M-340M parameters, specialized for classification, fast, cheap
GPT-class models: 7B-70B parameters, general purpose, slower, more expensive

For high-volume classification: use BERT. Fine-tuned BERT achieves 90%+ accuracy on most standard classification tasks for a tiny fraction of the inference cost.

For complex, nuanced classification where accuracy matters more than cost (medical diagnosis, legal classification): decoder-only with few-shot prompting may win - but evaluate carefully.

Encoder-Decoder for RAG

In RAG (Retrieval-Augmented Generation), the retriever and generator can use different architectures:

Retriever: Encoder-only (bi-encoder for fast retrieval - encode query and document separately)
Generator: Decoder-only (for flexible generation with retrieved context)

Using a decoder-only model as a retriever is possible but requires cross-encoding (slower) - the query and document must be concatenated for the decoder to produce a relevance score.

Common Mistakes

:::danger Using decoder-only for dense embeddings GPT-style decoder-only models produce anisotropic embeddings (all token representations cluster in a narrow cone of the embedding space). They are poor at producing similarity-preserving embeddings without special fine-tuning (contrastive learning). If you need sentence embeddings for semantic search, use a BERT-class model or a specifically fine-tuned decoder model (e.g., E5-mistral). :::

:::warning Bidirectional attention for generation tasks If you accidentally use bidirectional attention (no causal mask) in a generation model, the model can "cheat" during training by looking at future tokens. Training loss will be very low (it's an easy task), but generation will fail - the model has no autoregressive capability since it was never trained for it. Always verify the causal mask is applied in decoder self-attention layers. :::

:::tip Prefix-LM: the hybrid approach Some models use a "prefix-LM" approach: bidirectional attention for the prompt (input), causal attention for the generated response. This gives the model full context over the input and autoregressive generation for the output. Used in PaLM and some T5 variants. It's a practical hybrid that can outperform pure decoder-only on tasks with long, structured inputs. :::

Interview Q&A

Q1: What is the difference between encoder-only and decoder-only transformers? What is each good for?

Answer: The fundamental difference is the attention pattern:

Encoder-only uses bidirectional (full) self-attention - every token can attend to every other token in the sequence. This gives each token's representation full context from both left and right. Excellent for understanding tasks: classification, embeddings, NER, extractive QA.

Decoder-only uses causal (unidirectional) self-attention - each token can only attend to previous tokens. This enables autoregressive generation: given tokens 1... $t-1$ , predict token $t$ . Excellent for generation: text completion, chatbots, code generation, reasoning.

Key practical differences:

Encoder-only cannot generate text (no autoregressive mechanism)
Decoder-only can technically do both via prompting but produces lower-quality embeddings for retrieval
Encoder-only inference is faster (no KV cache needed, single forward pass)
Decoder-only training is more data-efficient (uses all tokens, not 15%)

Q2: Why does decoder-only architecture dominate large language models today?

Answer: Three converging factors:

Training efficiency: Next-token prediction trains on 100% of tokens. MLM (BERT-style) trains on 15%. Over 1T+ training tokens, this gap is enormous.
Versatility at scale: A large decoder-only model can perform any task via prompting - no architecture-specific fine-tuning needed. GPT-3 demonstrated that at 175B parameters, few-shot prompting outperforms task-specific fine-tuned smaller models.
Inference optimization: Autoregressive generation with KV caching is well-understood and efficiently optimized. The causal attention pattern maps cleanly to incremental generation - each new token just attends to the cached previous K, V tensors.

The encoder-decoder's separate source/target processing made sense for translation-specific tasks but became a disadvantage when the goal shifted to general-purpose instruction following. The decoder-only model is simpler, more unified, and scales better.

Q3: Explain cross-attention in an encoder-decoder model. How is it different from self-attention?

Answer: Cross-attention is mechanically identical to self-attention, but the source of Q, K, V is different:

Self-attention: Q, K, V all come from the same sequence. "How should each token in sequence A relate to other tokens in A?"
Cross-attention: Q comes from the decoder (target sequence), K and V come from the encoder output (source sequence). "How should each decoder token relate to source tokens?"

In the decoder's cross-attention:

The decoder's current hidden state becomes the query - "what am I currently generating and what do I need from the source?"
The encoder's output provides keys - "which source positions are relevant?"
The encoder's output provides values - "what information should I retrieve from those positions?"

For translation: when generating the French word for "bank", the decoder's query attends to "banque" or "rive" in the source English encoding, depending on the broader context. The attention weights act as an alignment mechanism - the model learns which source tokens to look at for each target token.

Q4: A startup asks you to build a Q&A system over a 1 million document corpus. Which architecture would you choose for the retriever and which for the generator?

Answer: Classic RAG (Retrieval-Augmented Generation) architecture:

Retriever: Encoder-only bi-encoder (e.g., sentence-transformers, E5, GTE)

Reasoning: You need to embed 1 million documents into vectors and then embed each query to find the most similar documents by dot product. This requires dense, similarity-preserving embeddings. Encoder-only models (BERT-class) produce better-calibrated embeddings for this purpose. You'd pre-compute and index all document embeddings (using FAISS, pgvector, etc.) - retrieval at query time is a fast vector similarity search.

Generator: Decoder-only (e.g., LLaMA-2, Mistral, Claude)

Reasoning: Given the retrieved documents and the question in the prompt, you need to generate a natural language answer. Decoder-only models are the state of the art for this. The retrieved context is concatenated into the prompt.

Production considerations:

Retriever should be fast - the query embedding is on the critical path
Use approximate nearest neighbor search (FAISS IVF) for 1M documents
Re-ranking stage: a cross-encoder (also encoder-only) can re-score the top-50 retrieved docs more accurately before passing the top-5 to the generator
Generator context window determines how many retrieved chunks you can use

Q5: BERT performs bidirectional attention. Why is this a problem for text generation?

Answer: Bidirectional attention allows each position to see future tokens. For text generation (predicting the next token), this creates an impossible task during training:

If we train a bidirectional model to predict token $t$ , it can simply look at token $t$ (which is in the input) and copy it. The model achieves zero loss without learning anything meaningful.

More fundamentally: autoregressive generation requires a model that, given only tokens $1, ..., t-1$ , predicts token $t$ . A bidirectional model has seen all tokens $1, ..., T$ during training. At generation time, the future tokens don't exist - the model is being asked to operate in a mode it was never trained for.

There is no way to autoregressively generate with a purely bidirectional model without retraining.

BERT's pretraining task (masked LM) is different: predict specific masked tokens given both left and right context. This is not the same as next-token prediction. A BERT model can fill in a mask, but it cannot generate text one token at a time.

This is the fundamental reason encoder-only models are used for understanding tasks (they excel with full context) and decoder-only models are used for generation (they're trained for it).

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Encoder vs Decoder vs Encoder-Decoder demo on the EngineersOfAI Playground - no code required.

:::

The Architecture War of 2018-2020​

The Three Architectures​

Encoder-Only: BERT and Its Descendants​

Decoder-Only: GPT and the Modern LLM​

The Causal Mask: Implementation​

Encoder-Decoder: T5 and Structured Generation​

Cross-Attention: The Bridge in Encoder-Decoder​

Architecture Decision Guide​

Architecture Comparison Table​

Why Decoder-Only Won at Scale​

Production Engineering Notes​

Decoder-Only for Classification​

Encoder-Decoder for RAG​

Common Mistakes​

Interview Q&A​

Q1: What is the difference between encoder-only and decoder-only transformers? What is each good for?​

Q2: Why does decoder-only architecture dominate large language models today?​

Q3: Explain cross-attention in an encoder-decoder model. How is it different from self-attention?​

Q4: A startup asks you to build a Q&A system over a 1 million document corpus. Which architecture would you choose for the retriever and which for the generator?​

Q5: BERT performs bidirectional attention. Why is this a problem for text generation?​

The Architecture War of 2018-2020

The Three Architectures

Encoder-Only: BERT and Its Descendants

Decoder-Only: GPT and the Modern LLM

The Causal Mask: Implementation

Encoder-Decoder: T5 and Structured Generation

Cross-Attention: The Bridge in Encoder-Decoder

Architecture Decision Guide

Architecture Comparison Table

Why Decoder-Only Won at Scale

Production Engineering Notes

Decoder-Only for Classification

Encoder-Decoder for RAG

Common Mistakes

Interview Q&A

Q1: What is the difference between encoder-only and decoder-only transformers? What is each good for?

Q2: Why does decoder-only architecture dominate large language models today?

Q3: Explain cross-attention in an encoder-decoder model. How is it different from self-attention?

Q4: A startup asks you to build a Q&A system over a 1 million document corpus. Which architecture would you choose for the retriever and which for the generator?

Q5: BERT performs bidirectional attention. Why is this a problem for text generation?