Skip to main content

LLaMA Family Architecture

A Tuesday Morning at 2 AM

The on-call engineer at a mid-sized fintech gets paged. A customer-facing AI assistant has gone down. The model vendor - a closed API - is experiencing an outage. There is no fallback. Fifty thousand users see a spinner where their financial summaries should be. The outage lasts six hours.

Three months later, that same engineer is presenting to the CTO. The proposal: move the entire AI stack to self-hosted open-source models. The CTO pushes back. "Which model? Trained by who? On what data? Are we sure it's safe?" The engineer pulls up a slide with a single word: LLaMA.

That conversation - some version of it - happened at hundreds of companies between 2023 and 2024. Meta's LLaMA models were the first serious open-source language models that could be deployed in production, iterated on, fine-tuned, quantized, and run without paying per-token fees to a closed API. They changed the economics of AI deployment.

But LLaMA is not one model. It is a family of architectures, each version introducing specific engineering improvements that solve concrete problems. Understanding why each choice was made - RoPE instead of learned position embeddings, SwiGLU instead of GELU, RMSNorm instead of LayerNorm, GQA instead of MHA - is the difference between a practitioner who can deploy and debug these systems and one who treats them as black boxes.

This lesson covers the full LLaMA lineage: what changed, why it changed, and what it means for you when you are choosing, fine-tuning, or deploying a model in production. By the end, you will be able to look at any transformer architecture spec and understand not just what it does but why it was chosen over the alternatives.


Why This Exists

The Problem Before LLaMA

Before February 2023, if you wanted to run a serious language model, you had two options. You could use a closed API (GPT-4, Claude, PaLM) and accept that you had no control over the model, no ability to fine-tune it on your data, and complete dependency on a vendor's uptime and pricing. Or you could train your own model from scratch, which required hundreds of millions of dollars in compute and a research team most companies cannot afford.

There were open models before LLaMA. BLOOM (176B parameters) was open. GPT-J, GPT-NeoX - these existed. But they were either too large to run efficiently, trained on data of questionable quality, or architecturally outdated. None of them could hold their own against GPT-3.5 in a real benchmark.

The deeper problem was not just model quality. It was the absence of a reference architecture that practitioners could trust, study, and build on. The research community needed a well-engineered baseline that was small enough to experiment with but competitive enough to matter.

What LLaMA Solved

Meta's LLaMA paper (Touvron et al., 2023) did something that seems obvious in retrospect: it asked what is the most performant model you can train on publicly available data at a compute budget that a research lab can afford, optimized for inference efficiency rather than raw parameter count.

That framing - inference efficiency, not just benchmark performance - was the shift. GPT-3 was 175B parameters. LLaMA 7B and 13B matched it on many benchmarks while being 10-25x smaller. The reason was training compute: LLaMA was trained on far more tokens than most models at that size. A 7B model trained on 1 trillion tokens can outperform a 13B model trained on 300 billion tokens, because the smaller model has seen more data and converged further.


Historical Context

The LLaMA 1 Paper (February 2023)

Hugo Touvron and colleagues at Meta AI released LLaMA 1 in February 2023. The paper was initially restricted to researchers under a noncommercial license, but the weights leaked within a week and spread across the internet. Within days, people were running LLaMA 7B on laptops with 8GB of RAM using 4-bit quantization.

The "aha moment" in the LLaMA 1 paper was the training token argument. The authors showed that for a fixed inference budget (i.e., you want the fastest possible model at inference time), you should train a smaller model on more data rather than training a larger model on less data. This was a direct challenge to the Chinchilla scaling law interpretation that had dominated 2022.

Chinchilla (Hoffmann et al., 2022) had shown that models were being undertrained - you should train roughly 20 tokens per parameter for compute-optimal training. But "compute-optimal" means optimal for the training run, not for inference. If you are going to run a model billions of times, you want the smallest possible model that can achieve your target quality, even if it costs more to train.

LLaMA 1's insight: train a 7B model on 1T tokens (far more than Chinchilla-optimal for that size), and you get a model that is cheap to serve but highly capable.

LLaMA 2 (July 2023)

LLaMA 2 arrived five months later with a more permissive license (commercial use allowed for most companies), larger context (4096 tokens vs 2048), and the introduction of Grouped Query Attention (GQA) at the 70B scale. It also came with RLHF-tuned chat variants and Llama Guard for safety filtering.

LLaMA 3 (April 2024) and LLaMA 3.1 (July 2024)

LLaMA 3 was a step change. The vocabulary expanded from 32,000 to 128,256 tokens (based on tiktoken's cl100k_base tokenizer), context grew to 8,192 tokens for the base release and 128,000 tokens for LLaMA 3.1. GQA was applied at all model sizes (not just 70B), multilingual capability was added, and the training data was expanded to 15 trillion tokens of high-quality text.

LLaMA 3.1 405B became the first open-source model widely considered competitive with GPT-4.

LLaMA 3.2 and 3.3 (Late 2024)

LLaMA 3.2 introduced multimodal variants (11B and 90B vision models) and lightweight models (1B and 3B) optimized for on-device deployment. LLaMA 3.3 released a 70B model that matched the performance of LLaMA 3.1 405B at a fraction of the compute cost, representing continued efficiency improvements.


Core Architecture Concepts

The Transformer Foundation

Every LLaMA model is a decoder-only transformer: it processes tokens left-to-right, attends to all previous tokens, and predicts the next token. This is the same basic architecture as GPT-2, GPT-3, and most modern LLMs. The architectural decisions that differentiate LLaMA from these predecessors are in the details: how positions are encoded, how attention is computed, how layers are normalized, and what activation function is used in the feed-forward network.

Let's examine each choice in turn.


RoPE: Rotary Position Embeddings

The problem with learned position embeddings

In the original transformer, position was encoded by adding a learned embedding vector to each token. Token at position 1 gets embedding e1e_1, token at position 512 gets embedding e512e_{512}. The model learns what these mean during training.

The problem: if you trained on sequences up to length 2048, position embeddings 2049 through 4096 never appeared during training. The model has no idea what to do with them. Context length is hard-capped at training length with no principled way to extrapolate.

What RoPE does

RoPE (Su et al., 2021) encodes position differently. Instead of adding a position embedding to the token representation, it rotates the query and key vectors before computing attention. For a query vector at position mm and a key vector at position nn, the attention score depends only on the relative position mnm - n through the rotation.

Formally, for a 2D subspace, RoPE applies a rotation matrix:

Rm=(cos(mθ)sin(mθ)sin(mθ)cos(mθ))R_m = \begin{pmatrix} \cos(m\theta) & -\sin(m\theta) \\ \sin(m\theta) & \cos(m\theta) \end{pmatrix}

Where θ\theta is a fixed frequency for each dimension (inspired by sinusoidal embeddings). The key property is that the dot product qmknq_m \cdot k_n depends only on mnm - n (relative distance), not on absolute positions.

Why this matters for extrapolation

Because position is encoded as rotation, the model learns to attend based on how far apart tokens are, not where they are in absolute terms. This means you can extend context at inference time beyond training length using techniques like RoPE scaling or NTK-aware scaling. LLaMA 3.1 was trained with a 128k context using RoPE scaling - impossible with learned absolute position embeddings.

Intuition check: imagine a piece of music. Whether you play it starting at measure 1 or measure 100, the relationships between notes are the same. RoPE lets the model reason about "token A is 50 positions before token B" without caring whether A is at position 5 or position 5005.


SwiGLU Activation Function

The feed-forward layer in standard transformers

Every transformer layer has two sub-layers: multi-head attention, and a feed-forward network (FFN). The FFN is:

FFN(x)=activation(xW1+b1)W2+b2\text{FFN}(x) = \text{activation}(xW_1 + b_1) W_2 + b_2

Original transformers used ReLU. GPT-2 and GPT-3 used GELU (Gaussian Error Linear Unit). LLaMA uses SwiGLU.

What SwiGLU is

SwiGLU (Shazeer, 2020) is a gated linear unit variant. Instead of one projection and one activation, it uses two parallel projections and multiplies them:

SwiGLU(x,W,V,b,c)=Swish(xW+b)(xV+c)\text{SwiGLU}(x, W, V, b, c) = \text{Swish}(xW + b) \otimes (xV + c)

Where Swish(x)=xσ(x)\text{Swish}(x) = x \cdot \sigma(x) (sigmoid-weighted linear unit), and \otimes is elementwise multiplication.

In LLaMA's implementation (no bias terms):

FFNSwiGLU(x)=(Swish(xW1)xW3)W2\text{FFN}_{\text{SwiGLU}}(x) = (\text{Swish}(xW_1) \otimes xW_3) W_2

This requires three weight matrices instead of two, so to keep parameter count equivalent, LLaMA reduces the hidden dimension by a factor of 2/32/3.

Why SwiGLU improves performance

The gating mechanism allows the network to selectively suppress certain features. Think of it as the FFN learning both "what is this feature" and "how much should I use it." Empirically, SwiGLU consistently outperforms ReLU and GELU on language modeling at fixed parameter count, which is why every major LLM since 2022 has adopted it.


RMSNorm: Root Mean Square Layer Normalization

The problem with LayerNorm

Standard Layer Normalization (Ba et al., 2016) normalizes activations by subtracting the mean and dividing by the standard deviation, then applies learned scale and shift parameters:

LayerNorm(x)=xμσγ+β\text{LayerNorm}(x) = \frac{x - \mu}{\sigma} \cdot \gamma + \beta

Where μ\mu and σ\sigma are computed over the feature dimension. This involves computing both mean and variance - two passes through the activations.

RMSNorm simplifies this

RMSNorm (Zhang and Sennrich, 2019) drops the mean-centering and only normalizes by root mean square:

RMSNorm(x)=xRMS(x)γ\text{RMSNorm}(x) = \frac{x}{\text{RMS}(x)} \cdot \gamma

Where RMS(x)=1ni=1nxi2\text{RMS}(x) = \sqrt{\frac{1}{n} \sum_{i=1}^{n} x_i^2}

No mean subtraction, no shift parameter β\beta. Just divide by RMS, then scale.

The engineering argument

RMSNorm is 7-15% faster than LayerNorm in practice (it eliminates the mean computation and the bias parameter). The empirical quality difference is negligible. For a model running billions of forward passes, 10% faster normalization compounds into meaningful cost savings.

LLaMA also applies normalization before the attention and FFN sub-layers (Pre-LN rather than Post-LN), which improves training stability for large models.


Grouped Query Attention (GQA)

Multi-Head Attention and the KV cache problem

In standard Multi-Head Attention (MHA), every attention head has its own Query, Key, and Value weight matrices. During inference, you maintain a Key-Value cache (KV cache) that stores the K and V tensors for all previous tokens so you do not recompute them.

For a model with hh heads, hidden dimension dd, and sequence length ss:

KV cache size=2×h×dh×s×num_layers×bytes_per_element\text{KV cache size} = 2 \times h \times \frac{d}{h} \times s \times \text{num\_layers} \times \text{bytes\_per\_element}

For LLaMA 2 70B with 64 heads, 8192 sequence length, 80 layers, FP16: this is about 80GB just for the KV cache. That is more than the model weights themselves.

Multi-Query Attention (MQA)

Shazeer (2019) proposed Multi-Query Attention: use a single K and V head shared across all Q heads. This reduces the KV cache by a factor of hh (number of heads). The problem: quality degrades for large models.

GQA: the middle ground

GQA (Ainslie et al., 2023) divides the hh query heads into gg groups, each group sharing one K and V head. For LLaMA 3 70B with 64 query heads and 8 KV heads, g=8g = 8, so the KV cache is 8/64=1/88/64 = 1/8 the size of MHA - an 8x reduction.

The quality tradeoff: GQA with 8 KV heads is nearly indistinguishable from MHA at the 70B scale. The memory savings are enormous.

GQA KV cache=1h/g×MHA KV cache\text{GQA KV cache} = \frac{1}{h/g} \times \text{MHA KV cache}

For LLaMA 3 8B: 8 query heads per group, 8 KV heads, same 8x reduction.


LLaMA Version Comparison

FeatureLLaMA 1LLaMA 2LLaMA 3LLaMA 3.1
Sizes7B, 13B, 33B, 65B7B, 13B, 34B, 70B8B, 70B, 405B8B, 70B, 405B
Context204840968192128,000
Vocabulary32,00032,000128,256128,256
NormalizationRMSNormRMSNormRMSNormRMSNorm
PositionRoPERoPERoPERoPE + scaling
ActivationSwiGLUSwiGLUSwiGLUSwiGLU
GQANo70B onlyAll sizesAll sizes
Training tokens1T2T15T15T+
LicenseNon-commercialCommercial (most)CommercialCommercial
MultilingualNoLimitedYes (8 languages)Yes (8 languages)

Architecture Diagrams


Code Examples

Loading a LLaMA Model with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load LLaMA 3.1 8B Instruct
model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16, # BF16 is preferred over FP16 for LLaMA 3
device_map="auto", # Distributes across available GPUs
attn_implementation="flash_attention_2" # 2-3x faster attention
)

# LLaMA 3 uses a special chat format
messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Explain grouped query attention in one paragraph."},
]

# Apply the chat template - critical for instruct models
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)

outputs = model.generate(
input_ids,
max_new_tokens=256,
temperature=0.7,
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
)

# Decode only the new tokens (not the prompt)
response = tokenizer.decode(
outputs[0][input_ids.shape[-1]:],
skip_special_tokens=True
)
print(response)

Inspecting the Architecture Programmatically

from transformers import AutoModelForCausalLM, AutoConfig
import torch

# Load config without downloading weights
config = AutoConfig.from_pretrained("meta-llama/Meta-Llama-3.1-8B")

print("=== LLaMA 3.1 8B Architecture ===")
print(f"Hidden size: {config.hidden_size}") # 4096
print(f"Intermediate size: {config.intermediate_size}") # 14336 (SwiGLU 3x hidden)
print(f"Num attention heads: {config.num_attention_heads}") # 32 query heads
print(f"Num KV heads: {config.num_key_value_heads}") # 8 KV heads (GQA)
print(f"Num layers: {config.num_hidden_layers}") # 32
print(f"Max position: {config.max_position_embeddings}")# 131072 (128k)
print(f"Vocabulary size: {config.vocab_size}") # 128256
print(f"RoPE theta: {config.rope_theta}") # 500000.0

# Compute KV cache size for 8192 tokens at BF16
num_layers = config.num_hidden_layers # 32
num_kv_heads = config.num_key_value_heads # 8
head_dim = config.hidden_size // config.num_attention_heads # 128
seq_len = 8192
bytes_per_element = 2 # BF16

kv_cache_bytes = (
2 * # K and V
num_layers *
num_kv_heads *
head_dim *
seq_len *
bytes_per_element
)
print(f"\nKV cache at 8192 tokens: {kv_cache_bytes / 1e9:.2f} GB") # ~0.54 GB

# Compare with hypothetical MHA (32 KV heads)
kv_cache_mha = kv_cache_bytes * (config.num_attention_heads / config.num_key_value_heads)
print(f"Equivalent MHA KV cache: {kv_cache_mha / 1e9:.2f} GB") # ~2.15 GB
print(f"GQA savings: {kv_cache_mha / kv_cache_bytes:.1f}x") # 4.0x

Implementing RMSNorm from Scratch

import torch
import torch.nn as nn

class RMSNorm(nn.Module):
"""
Root Mean Square Layer Normalization.
Used in LLaMA instead of standard LayerNorm.
Faster: no mean subtraction, no bias parameter.
"""
def __init__(self, dim: int, eps: float = 1e-6):
super().__init__()
self.eps = eps
self.weight = nn.Parameter(torch.ones(dim)) # learnable scale (gamma)
# No bias (beta) - RMSNorm drops it

def _norm(self, x: torch.Tensor) -> torch.Tensor:
# x shape: (batch, seq_len, dim)
# Compute RMS over last dimension
rms = torch.sqrt(torch.mean(x ** 2, dim=-1, keepdim=True) + self.eps)
return x / rms

def forward(self, x: torch.Tensor) -> torch.Tensor:
# Normalize in float32 for numerical stability, then cast back
output = self._norm(x.float()).type_as(x)
return output * self.weight


class LayerNorm(nn.Module):
"""Standard LayerNorm for comparison."""
def __init__(self, dim: int, eps: float = 1e-6):
super().__init__()
self.eps = eps
self.weight = nn.Parameter(torch.ones(dim)) # scale
self.bias = nn.Parameter(torch.zeros(dim)) # shift

def forward(self, x: torch.Tensor) -> torch.Tensor:
mean = x.mean(-1, keepdim=True)
var = x.var(-1, keepdim=True, unbiased=False)
return self.weight * (x - mean) / (var + self.eps).sqrt() + self.bias


# Quick benchmark
import time

dim = 4096
batch_size = 32
seq_len = 512
x = torch.randn(batch_size, seq_len, dim).cuda()

rms_norm = RMSNorm(dim).cuda()
layer_norm = LayerNorm(dim).cuda()

# Warmup
for _ in range(10):
_ = rms_norm(x)
_ = layer_norm(x)

torch.cuda.synchronize()

# Time RMSNorm
start = time.time()
for _ in range(1000):
_ = rms_norm(x)
torch.cuda.synchronize()
rms_time = time.time() - start

# Time LayerNorm
start = time.time()
for _ in range(1000):
_ = layer_norm(x)
torch.cuda.synchronize()
ln_time = time.time() - start

print(f"RMSNorm: {rms_time:.3f}s")
print(f"LayerNorm: {ln_time:.3f}s")
print(f"Speedup: {ln_time / rms_time:.2f}x") # Typically 1.1-1.15x

RoPE Implementation

import torch
import torch.nn.functional as F
from typing import Tuple

def precompute_freqs_cis(dim: int, seq_len: int, theta: float = 500000.0) -> torch.Tensor:
"""
Precompute the complex rotation frequencies for RoPE.
LLaMA 3 uses theta=500000 (vs 10000 in original RoPE).
Higher theta = slower rotation = better long-context extrapolation.
"""
# Frequencies: theta^(-2i/dim) for i in 0..dim/2
freqs = 1.0 / (theta ** (torch.arange(0, dim, 2).float() / dim))
# Positions
t = torch.arange(seq_len)
# Outer product: (seq_len, dim/2)
freqs = torch.outer(t, freqs)
# Convert to complex: e^(i * freq) = cos(freq) + i*sin(freq)
freqs_cis = torch.polar(torch.ones_like(freqs), freqs)
return freqs_cis


def apply_rotary_emb(
xq: torch.Tensor, # Query: (batch, seq_len, n_heads, head_dim)
xk: torch.Tensor, # Key: (batch, seq_len, n_kv_heads, head_dim)
freqs_cis: torch.Tensor,
) -> Tuple[torch.Tensor, torch.Tensor]:
"""Apply rotary position embeddings to Q and K tensors."""
# Reshape to complex: last dim pairs -> complex numbers
xq_r = xq.float().reshape(*xq.shape[:-1], -1, 2)
xk_r = xk.float().reshape(*xk.shape[:-1], -1, 2)
xq_complex = torch.view_as_complex(xq_r)
xk_complex = torch.view_as_complex(xk_r)

# freqs_cis shape: (seq_len, head_dim/2) -> broadcast over batch and heads
freqs_cis = freqs_cis.unsqueeze(0).unsqueeze(2) # (1, seq_len, 1, head_dim/2)

# Multiply = rotate in complex plane
xq_out = torch.view_as_real(xq_complex * freqs_cis).flatten(3)
xk_out = torch.view_as_real(xk_complex * freqs_cis).flatten(3)

return xq_out.type_as(xq), xk_out.type_as(xk)


# Demonstrate relative position property of RoPE
def demo_rope_relative_distance():
dim = 64
seq_len = 512
freqs = precompute_freqs_cis(dim, seq_len)

# Two tokens at absolute positions 100 and 150 (relative distance: 50)
q_pos100 = torch.randn(1, 1, 1, dim)
k_pos150 = torch.randn(1, 1, 1, dim)

q_rot_100, k_rot_150 = apply_rotary_emb(
q_pos100, k_pos150, freqs[100:101]
)
score_100_150 = (q_rot_100 * k_rot_150).sum()

# Same two tokens at positions 200 and 250 (same relative distance: 50)
q_rot_200, k_rot_250 = apply_rotary_emb(
q_pos100, k_pos150, freqs[200:201]
)
score_200_250 = (q_rot_200 * k_rot_250).sum()

print(f"Score at positions (100, 150): {score_100_150:.4f}")
print(f"Score at positions (200, 250): {score_200_250:.4f}")
# These will be similar - RoPE encodes relative position, not absolute

Quantization for Production Deployment

# 4-bit quantization with bitsandbytes - the standard for deploying LLaMA on consumer hardware
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# BnB 4-bit NF4 config (recommended for LLaMA)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4 - best quality for 4-bit
bnb_4bit_compute_dtype=torch.bfloat16, # Compute in BF16
bnb_4bit_use_double_quant=True, # Quantize the quantization constants too
)

model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3.1-8B-Instruct",
quantization_config=bnb_config,
device_map="auto",
)

# LLaMA 3.1 8B in 4-bit NF4: ~4.5GB VRAM (vs ~16GB in BF16)
# Quality loss on standard benchmarks: ~1-2 perplexity points
print(f"Model memory footprint: {model.get_memory_footprint() / 1e9:.2f} GB")

# For fine-tuning quantized models, use QLoRA (LoRA on top of 4-bit base)
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model

model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"], # All linear layers
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Typical output: trainable params: 83,886,080 || all params: 8,113,238,016 || trainable%: 1.03

Production Engineering Notes

Memory Planning

Before deploying any LLaMA model, calculate your memory budget. The formula is:

Total VRAM = Model weights + KV cache + Activation memory + Framework overhead

Model weights (FP16/BF16): params * 2 bytes
Model weights (INT8): params * 1 byte
Model weights (INT4/NF4): params * 0.5 bytes

KV cache: 2 * layers * kv_heads * head_dim * max_seq_len * max_batch_size * 2 bytes

For LLaMA 3.1 8B serving 100 concurrent users at 2048 tokens each in BF16:

  • Weights: 8B * 2 = 16 GB
  • KV cache: 2 * 32 * 8 * 128 * 2048 * 100 * 2 = ~42 GB

This does not fit on a single A100 80GB. Options: 4-bit quantization, reduce batch size, use continuous batching with shorter average sequences, or use vLLM's PagedAttention.

vLLM for Production Serving

from vllm import LLM, SamplingParams

llm = LLM(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
tensor_parallel_size=1, # Number of GPUs
gpu_memory_utilization=0.90, # Reserve 10% for activations
max_model_len=8192, # Limit context to reduce KV cache
dtype="bfloat16",
)

sampling_params = SamplingParams(
temperature=0.8,
top_p=0.95,
max_tokens=512,
)

prompts = [
"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\nExplain RoPE<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n",
]

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)

Choosing the Right LLaMA Variant

Use CaseRecommended ModelReason
Laptop / edge deploymentLLaMA 3.2 1B or 3BFits in 2-4GB, designed for on-device
Single A100 80GB productionLLaMA 3.1 8B (BF16) or 70B (4-bit)Best quality per GPU
Research and fine-tuningLLaMA 3.1 8BGQA, 128k ctx, most tooling support
GPT-4-class taskLLaMA 3.1 405BOnly open model at this tier
Cost-optimized 70B qualityLLaMA 3.3 70BMatches 405B at 6x lower cost
Vision tasksLLaMA 3.2 11B or 90BOnly vision-capable variants
Long document processingLLaMA 3.1 (any size)128k context with RoPE scaling

Common Mistakes

:::danger Using LLaMA 2 Chat Format with LLaMA 3 Models LLaMA 2 and LLaMA 3 use completely different chat templates. LLaMA 2 uses [INST] and [/INST] tags. LLaMA 3 uses a structured header format with <|begin_of_text|>, <|start_header_id|>, <|end_header_id|>, and <|eot_id|> tokens.

Using the wrong template causes the model to behave erratically, ignore system prompts, and produce low-quality outputs. Always use tokenizer.apply_chat_template() - never construct the prompt string manually unless you have verified the exact format from the model card. :::

:::danger Forgetting pad_token_id During Generation LLaMA tokenizers do not define a pad token by default. If you call model.generate() without setting pad_token_id, you will get a warning and undefined behavior during batch generation. Always set pad_token_id=tokenizer.eos_token_id or configure the tokenizer explicitly.

# WRONG - will warn and may produce garbage
output = model.generate(input_ids, max_new_tokens=100)

# CORRECT
output = model.generate(input_ids, max_new_tokens=100, pad_token_id=tokenizer.eos_token_id)

:::

:::warning Using FP32 for LLaMA Inference LLaMA models were trained in BF16 mixed precision. Loading them in FP32 doubles your memory usage with no quality benefit and slower throughput. Always use torch_dtype=torch.bfloat16 (preferred) or torch_dtype=torch.float16. FP32 is appropriate only for quantization-aware fine-tuning of very small models. :::

:::warning Underestimating KV Cache Memory New practitioners focus on model weight memory and forget the KV cache. At long contexts with large batches, the KV cache can exceed the model weights in size. LLaMA 3.1 70B at 128k context in BF16 requires over 140GB of KV cache memory per batch element. Always pre-calculate your KV cache budget before deployment. :::

:::warning Confusing Base and Instruct Models LLaMA base models (no -Instruct suffix) are pretrained but not instruction-tuned. They will complete text, not follow instructions. A base model given "Tell me about Paris" will continue the sentence as a document, not answer the question. Always use -Instruct variants for chat or instruction-following applications. :::


Interview Q&A

Q1: Why does LLaMA use RoPE instead of learned absolute position embeddings? What is the practical impact?

Answer:

Learned absolute position embeddings assign a unique trainable vector to each position index. This creates two problems. First, the model never sees positions beyond its training context length, so extrapolation to longer sequences is unprincipled - the model may hallucinate or produce garbage for out-of-distribution positions. Second, the attention score between tokens depends on their absolute positions, not their relative distance, which is less natural for language (the relationship between "cat" and "sat" should be the same whether they appear at positions 5-6 or 500-501).

RoPE encodes position as a rotation applied to the query and key vectors. The rotation angle is proportional to position, and the dot product between a rotated query and rotated key naturally depends only on the difference in their rotation angles - i.e., their relative distance. This gives RoPE a built-in relative position bias without requiring explicit relative position matrices.

The practical impact: with RoPE, you can extend context at inference time using techniques like RoPE scaling (multiply the rotation frequencies by a constant) or NTK-aware scaling (adjust frequencies based on the extended sequence length). LLaMA 3.1 uses RoPE theta of 500,000 (vs the original 10,000) and was trained at 128k context - this would be impossible with learned absolute embeddings.

In production, this means LLaMA 3.1 8B can process entire legal contracts, codebases, or research papers in a single context window - a capability directly enabled by RoPE's extrapolation properties.


Q2: What is Grouped Query Attention, and why was it introduced in LLaMA? Calculate the KV cache savings for LLaMA 3 70B.

Answer:

Standard Multi-Head Attention (MHA) gives each of the hh attention heads its own set of Key and Value weight matrices. During autoregressive decoding, you cache the K and V tensors for all past tokens to avoid recomputation (the KV cache). For large models, this cache is enormous.

Grouped Query Attention (GQA, Ainslie et al. 2023) divides the hh query heads into gg groups, with each group sharing one K head and one V head. This reduces the KV cache by a factor of h/gh/g.

For LLaMA 3 70B: 64 query heads, 8 KV heads, 80 layers, head dimension 128, BF16 (2 bytes).

MHA KV cache per token: 2×80×64×128×2=2.622 \times 80 \times 64 \times 128 \times 2 = 2.62 MB per token GQA KV cache per token: 2×80×8×128×2=0.332 \times 80 \times 8 \times 128 \times 2 = 0.33 MB per token

At 8192 tokens: MHA requires ~21.5 GB just for KV cache; GQA requires ~2.7 GB. That is an 8x reduction, freeing ~19 GB of VRAM that can be used for larger batch sizes or longer contexts.

The quality cost is minimal at the 70B scale. GQA with 8 KV heads matches MHA within ~0.5% on standard benchmarks. This is the reason GQA was adopted universally - the memory savings are massive and the quality penalty is negligible.


Q3: Explain the SwiGLU activation function. Why does it require three weight matrices instead of two, and how does LLaMA compensate to maintain parameter count?

Answer:

Standard FFN in transformers: FFN(x)=activation(xW1)W2\text{FFN}(x) = \text{activation}(xW_1)W_2. Two weight matrices: W1Rd×dffW_1 \in \mathbb{R}^{d \times d_{ff}} and W2Rdff×dW_2 \in \mathbb{R}^{d_{ff} \times d}.

SwiGLU replaces this with: FFN(x)=(Swish(xW1)xW3)W2\text{FFN}(x) = (\text{Swish}(xW_1) \otimes xW_3)W_2

Now there are three matrices: W1,W3Rd×dffW_1, W_3 \in \mathbb{R}^{d \times d_{ff}} (projections into FFN space) and W2Rdff×dW_2 \in \mathbb{R}^{d_{ff} \times d} (projection back). The Swish gate modulates which features from the second projection pass through.

To keep total parameter count equal to a standard FFN with hidden dimension 4d4d, LLaMA reduces dffd_{ff} by 2/32/3:

Standard FFN: 2×d×4d=8d22 \times d \times 4d = 8d^2 parameters SwiGLU FFN: 3×d×8d3=8d23 \times d \times \frac{8d}{3} = 8d^2 parameters (same)

For LLaMA 3 8B: d=4096d = 4096, standard FFN would have dff=16384d_{ff} = 16384. LLaMA uses dff=143368×40963d_{ff} = 14336 \approx \frac{8 \times 4096}{3}, rounded to a multiple of 128 for hardware efficiency.

Why SwiGLU is better: the gating mechanism allows the network to multiplicatively suppress features. This is a form of dynamic feature selection that pure additive activations cannot achieve. Empirically, SwiGLU consistently improves perplexity by 0.5-1 points at fixed parameter count, which is why it was adopted by PaLM, LLaMA, Mistral, and most post-2022 LLMs.


Q4: What changed between LLaMA 1 and LLaMA 3 in terms of vocabulary size, and why does this matter?

Answer:

LLaMA 1 and 2 used a SentencePiece BPE tokenizer with a vocabulary of 32,000 tokens. LLaMA 3 switched to tiktoken (the tokenizer used in OpenAI's models) with 128,256 tokens.

The vocabulary size matters for several reasons:

  1. Tokenization efficiency: A larger vocabulary means more words are represented as single tokens rather than split into subwords. "tokenization" might be one token in LLaMA 3 but two tokens in LLaMA 2. More efficient tokenization means you pack more semantic content into the same context window. LLaMA 3's tokenizer is about 30% more efficient than LLaMA 2's on English text and significantly more efficient on code and multilingual text.

  2. Code and math performance: Code tokens like __init__, self., return are common enough to warrant single-token representation. With 128k vocabulary, many such patterns are single tokens, reducing sequence length and improving model focus on semantics rather than syntax reconstruction.

  3. Multilingual capability: 32k vocabulary covers English well but struggles with non-Latin scripts. With 128k vocabulary, LLaMA 3 can represent Korean, Arabic, Chinese, and other scripts without extreme fragmentation.

  4. LM head cost: The language model head is a linear layer from hidden dimension to vocabulary size. For LLaMA 3 8B: 4096×128256=525M4096 \times 128256 = 525M parameters just in the LM head. This is a non-trivial fraction of total parameters. Meta uses weight tying (the LM head shares weights with the embedding layer) to amortize this cost.


Q5: When would you choose LLaMA 3.3 70B over LLaMA 3.1 405B for a production deployment?

Answer:

LLaMA 3.3 70B is the right choice in almost all cost-sensitive production deployments where 405B-level quality is acceptable but inference cost is a concern.

The key facts: LLaMA 3.3 70B was trained with improved data recipes and achieves benchmark parity with LLaMA 3.1 405B on most standard benchmarks (MMLU, HumanEval, MATH, etc.). The inference cost difference is roughly 6x (405B requires ~6x more compute per token than 70B at equivalent hardware utilization).

Choose LLaMA 3.3 70B when:

  • You are running a high-volume production API where cost per token matters
  • You have 1-2 A100 80GB GPUs and want to run in BF16 (70B fits, 405B does not)
  • Your task is well-represented by standard benchmarks (instruction following, coding, reasoning, summarization)
  • You need to run multiple fine-tuned variants and cannot afford 405B for each

Choose LLaMA 3.1 405B when:

  • You need the absolute frontier of open-source capability for a task where 70B measurably underperforms (often very long-context tasks, complex multi-step reasoning, or specialized domains)
  • You are creating synthetic training data for smaller models (405B's output quality matters when distilling into 8B or 70B)
  • You are running offline batch inference where throughput matters more than latency
  • You are benchmarking against GPT-4 and need the most comparable open model

In practice, 70B is the right default. Upgrade to 405B only when you have demonstrated with eval data that 70B is insufficient for your specific task.


Q6: Describe the LLaMA training data strategy. Why does "more tokens, smaller model" make sense for a deployed system?

Answer:

The central insight from the LLaMA 1 paper is the distinction between compute-optimal training (minimizing training FLOPs to reach a target loss) and inference-optimal training (minimizing inference FLOPs to reach a target quality threshold).

Chinchilla (Hoffmann et al., 2022) showed that for a given training compute budget CC, the optimal model size is NC/6N \approx \sqrt{C / 6} and training tokens D2ND \approx 2N. This is "Chinchilla-optimal" - you minimize training loss per training FLOP.

But this optimizes the wrong thing for deployment. If you are going to run a model 10 billion times, you want the cheapest possible model that meets your quality bar. A 7B model costs 7x less to run than a 70B model. So if you can train a 7B model to 70B quality by training it on 10x more tokens, that is a good investment.

LLaMA 1 trained 7B models on 1 trillion tokens (far above Chinchilla-optimal for 7B, which would suggest ~140B tokens). LLaMA 3 scaled this to 15 trillion tokens for 8B and 70B models. The result: LLaMA 3 8B outperforms LLaMA 2 70B despite being 9x smaller, because it has seen far more data.

This strategy has implications for fine-tuning too. A heavily pre-trained base model needs less fine-tuning to reach instruction-following capability, because much of the task knowledge is already in the weights. LLaMA 3's instruct variants required proportionally less RLHF compute than LLaMA 2's because the base was more capable.

The practical lesson: when choosing between a larger model trained on less data and a smaller model trained on more data at similar quality, prefer the smaller model. You will save money on every inference call for the lifetime of the deployment.


Summary

The LLaMA family represents a systematic set of engineering choices optimized for inference efficiency without sacrificing training quality:

  • RoPE: Enables long-context extrapolation through relative position encoding via rotation
  • SwiGLU: Improves quality per parameter through gated multiplicative feature selection
  • RMSNorm: Reduces normalization overhead by 10-15% without quality loss
  • GQA: Cuts KV cache memory by 4-8x through key-value head sharing

Each version of LLaMA built on these foundations: LLaMA 2 added commercial licensing and longer context; LLaMA 3 expanded vocabulary and training data by 7x; LLaMA 3.1 brought 128k context through RoPE scaling; LLaMA 3.3 showed that data quality improvements can close the gap between 70B and 405B.

For practitioners, the decision matrix is straightforward: use LLaMA 3.2 1B/3B for edge deployment, LLaMA 3.1 8B or 3.3 70B for most production workloads, and LLaMA 3.1 405B only when you have verified that smaller models fall short on your specific task. Every other choice comes down to the same question LLaMA was built to answer: what is the minimum inference cost that achieves your quality target?

© 2026 EngineersOfAI. All rights reserved.