LoRA and PEFT - Fine-Tuning at a Fraction of the Cost

Reading time: ~35 min | Interview relevance: Critical | Roles: MLE, AI Engineer, Research Engineer

The Real Interview Moment

You're in a system design interview at an AI startup. The interviewer describes their product: "We have a 7B parameter base model and need to create specialized versions for 50 different enterprise customers - each with their own domain data, tone, and compliance requirements. Full fine-tuning each variant would cost us $50K in compute per customer and require 50 separate model copies in production. Walk me through how you'd make this economically viable."

The answer they're looking for is LoRA \text{---} but not just "use LoRA." They want you to explain why low-rank adaptation works theoretically, how to choose the rank, which layers to target, how to serve multiple adapters efficiently, and what trade-offs you're making. They want to know if you understand the math behind $W = W_0 + BA$ or if you just copied a tutorial.

This is the PEFT interview. Every AI company building on foundation models needs engineers who can fine-tune efficiently. If you can't explain LoRA, you're missing a core competency for modern AI engineering.

What You Will Master

After reading this page, you will be able to:

Explain the low-rank hypothesis and why it makes efficient fine-tuning possible
Derive the LoRA factorization and explain each component
Choose appropriate rank, target layers, and hyperparameters for LoRA
Compare LoRA with other PEFT methods and explain when to use each
Explain QLoRA and why quantization + LoRA is so effective
Implement LoRA fine-tuning using the PEFT library
Design multi-tenant serving architectures with LoRA adapters
Answer every common interview question about parameter-efficient fine-tuning

Part 1 - The Problem: Full Fine-Tuning Doesn't Scale

The Cost of Full Fine-Tuning

When you fine-tune a model, you update all parameters:

$\theta_\text{new} = \theta_\text{pretrained} - \eta \sum_t \nabla_\theta \mathcal{L}(\theta, x_t, y_t)$

For a 7B parameter model in float16:

Resource	Full Fine-Tuning	Why
Model weights	14 GB	7B x 2 bytes (fp16)
Gradients	14 GB	Same size as weights
Optimizer states (AdamW)	56 GB	2 moments in fp32 = 4x weights
Activations	10-50 GB	Depends on batch size, sequence length
Total GPU memory	~100+ GB	Requires A100 80GB or multi-GPU
Per-customer copy	14 GB storage	Full separate model per variant

For 50 customers: 50 x 14 GB = 700 GB just for model storage, plus 50 separate fine-tuning runs.

Instant Rejection

Never say "LoRA trains fewer parameters so it's faster per step." LoRA's forward pass is actually slightly slower than the base model (it has extra computation from the adapter). The savings come from (1) dramatically less memory for optimizer states and gradients, (2) much smaller adapter checkpoints (~10 MB vs 14 GB), and (3) ability to share the base model across customers.

The Low-Rank Hypothesis

The key insight from the LoRA paper (Hu et al., 2021):

The weight updates during fine-tuning have low intrinsic rank.

When you fine-tune a pre-trained model, you're not changing the weights arbitrarily. The updates $\Delta W = W_\text{fine-tuned} - W_\text{pretrained}$ tend to lie in a low-dimensional subspace.

Why? Pre-training already learns a powerful general-purpose representation. Fine-tuning only needs to "steer" this representation toward a specific task - a much simpler transformation than learning from scratch.

Empirical evidence: Aghajanyan et al. (2021) showed that fine-tuning succeeds even when projected to a very low-dimensional subspace (rank 1-4 often suffices for many tasks).

Full Fine-Tuning vs LoRA

Part 2 - The LoRA Math

Core Formulation

For a pre-trained weight matrix $W_0 \in \mathbb{R}^{d \times k}$ , LoRA represents the update as:

$W = W_0 + \Delta W = W_0 + BA$

where:

$B \in \mathbb{R}^{d \times r}$ - the "up-projection"
$A \in \mathbb{R}^{r \times k}$ - the "down-projection"
$r \ll \min(d, k)$ - the rank

Forward pass:

$h = W_0 x + \frac{\alpha}{r} B A x$

where $\alpha$ is a scaling factor (the "LoRA alpha").

Parameter Savings

For a single weight matrix $W_0 \in \mathbb{R}^{d \times k}$ :

Method	Parameters	Example ( $d=4096$ , $k=4096$ , $r=16$ )
Full fine-tuning	$d \times k$	16,777,216
LoRA	$d \times r + r \times k$	131,072
Savings	$\frac{d \times r + r \times k}{d \times k} = \frac{2r}{d}$	0.78% of full

60-Second Answer

"LoRA factorizes the weight update into two small matrices - a down-projection A that compresses the input to a low-rank space, and an up-projection B that maps back to the original dimension. Since the update has low intrinsic rank, this captures most of the fine-tuning effect with fewer than 1% of the parameters. The key trick is that A is initialized randomly and B is initialized to zero, so the adapter starts as an identity (no change) and gradually learns the adaptation."

Initialization

This is a critical detail that interviewers love to ask about:

$A$ is initialized with random Gaussian (Kaiming uniform)
$B$ is initialized to zero

Why? At the start of training, $\Delta W = BA = 0$ , so the model begins as the exact pre-trained model. This ensures training stability - you don't corrupt the pre-trained representations from step 1.

The Scaling Factor $\alpha / r$

The update is scaled by $\frac{\alpha}{r}$ :

$h = W_0 x + \frac{\alpha}{r} B A x$

Why divide by $r$ ? When you increase the rank, the magnitude of $BA$ increases (more terms contribute). Dividing by $r$ keeps the update magnitude roughly constant regardless of rank, making it easier to transfer hyperparameters across different rank settings.

Common settings:

$\alpha = r$ (effective scaling = 1) - simplest, works well
$\alpha = 2r$ (effective scaling = 2) - more aggressive adaptation
$\alpha = 16$ , $r = 16$ is a very common default

Common Trap

Many candidates confuse LoRA alpha ( $\alpha$ ) with the learning rate. They serve different purposes: $\alpha$ scales the magnitude of the LoRA update relative to the pretrained weights, while the learning rate controls the optimizer step size. You typically use a higher learning rate for LoRA parameters (e.g., 1e-4 to 3e-4) compared to full fine-tuning (e.g., 2e-5) because you're only updating a small number of parameters.

Part 3 - Which Layers to Adapt

Layer Selection Strategy

Not all layers benefit equally from LoRA adaptation. The original paper tested on transformers and found:

Layer	LoRA Effectiveness	Reasoning
$W_Q$ (query)	High	Modifies attention patterns
$W_K$ (key)	Medium-High	Affects what tokens attend to
$W_V$ (value)	High	Changes information extraction
$W_O$ (output projection)	Medium	Post-attention transformation
MLP up-projection	Medium-High	Modifies feature computation
MLP down-projection	Medium	Feature combination
Embedding layers	Low	Base vocabulary is sufficient
LayerNorm	Very Low	Already tiny, just tune directly

Typical configurations:

# Conservative - attention only (original LoRA paper)
target_modules = ["q_proj", "v_proj"]

# Standard - all attention projections
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"]

# Aggressive - attention + MLP
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                  "up_proj", "gate_proj", "down_proj"]

Transformer Block LoRA Target Layers

Choosing the Rank

Rank ( $r$ )	Parameters (7B model, QV only)	Quality	Use Case
4	~2M (0.03%)	Good for simple tasks	Classification, sentiment
8	~4M (0.06%)	Good general default	Most fine-tuning tasks
16	~8M (0.11%)	Strong	Complex reasoning, code
32	~16M (0.23%)	Near full fine-tuning quality	Domain-heavy adaptation
64	~32M (0.46%)	Diminishing returns	Rarely needed
256	~130M (1.9%)	Approaching full fine-tuning	Almost never used

Company Variation

OpenAI: Uses LoRA for their fine-tuning API (user-facing). Default rank is proprietary but speculated to be r=8-16.
Meta: Published LLaMA with extensive LoRA benchmarks. Recommends r=8 for most tasks, r=64 for complex instruction following.
Google: Gemma models tested with LoRA rank 4-16. Layer selection matters more than rank.
Startups: Often use r=16 with all linear layers as a robust default.

Part 4 - Comparison with Other PEFT Methods

The PEFT Landscape

PEFT Landscape

Detailed Comparison

Method	How It Works	Trainable Params	Inference Latency	Merging	Quality
Full Fine-Tuning	Update all parameters	100%	Baseline	N/A	Best
LoRA	Low-rank weight updates	0.1-1%	No overhead (merge)	Yes ✅	Near full FT
Adapters	Insert small bottleneck layers	0.5-5%	+5-10% overhead	No ❌	Good
Prefix Tuning	Prepend learnable tokens to keys/values	0.1-1%	+2-5% overhead	No ❌	Moderate
Prompt Tuning	Prepend learnable embeddings to input	<0.1%	Minimal	No ❌	Lower (small models)
BitFit	Only tune bias terms	0.1%	No overhead	Yes ✅	Lower
QLoRA	LoRA + 4-bit quantization	0.1-1%	Minimal	Partial	Near LoRA
DoRA	Decompose weight into magnitude + direction	0.1-1%	No overhead (merge)	Yes ✅	Slightly > LoRA

Why LoRA Won

LoRA has three critical advantages that made it the dominant PEFT method:

No inference latency: After training, merge $\Delta W = BA$ into $W_0$ , producing a standard model with zero additional computation
Composability: Multiple LoRA adapters can be combined, swapped at serving time, or interpolated
No architecture changes: Unlike adapters, LoRA doesn't add new layers, simplifying deployment

60-Second Answer

"LoRA dominates PEFT because of three properties: First, you can merge the adapter into the base weights at inference, so there's zero latency cost. Second, adapters are small (~10MB) and composable - you can swap adapters for different tasks with a single base model. Third, it doesn't change the model architecture, so all existing inference infrastructure works unchanged. Adapters add latency, prefix tuning reduces context length, and prompt tuning doesn't work well for smaller models."

Part 5 - QLoRA: The Memory Breakthrough

The Problem QLoRA Solves

Even with LoRA, you still need to load the base model in fp16 - 14 GB for a 7B model. QLoRA (Dettmers et al., 2023) solves this by quantizing the base model to 4-bit:

Setup	Memory for 7B Model
Full fine-tuning (fp16)	~100 GB
LoRA (fp16 base)	~16 GB
QLoRA (4-bit base + LoRA)	~6 GB

How QLoRA Works

NF4 quantization: Quantize base model to 4-bit using NormalFloat4 - a data type designed for normally distributed weights
Double quantization: Quantize the quantization constants themselves (saves ~0.37 bits/param)
Paged optimizers: Use CPU memory as overflow for optimizer states using unified memory
LoRA in fp16/bf16: The LoRA adapters (B, A) are kept in higher precision

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model

# QLoRA configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # NormalFloat4
    bnb_4bit_compute_dtype=torch.bfloat16, # Compute in bf16
    bnb_4bit_use_double_quant=True,       # Double quantization
)

# Load quantized base model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto",
)

# Add LoRA adapters
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "up_proj", "gate_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Trainable: 0.23% of total parameters

The Forward Pass with QLoRA

QLoRA Forward Pass

Common Trap

QLoRA does NOT train the quantized weights. The 4-bit weights are frozen. Only the LoRA adapters (in higher precision) receive gradients. The backward pass computes gradients through the dequantized weights and updates only A and B. This is a common confusion point - quantization is for memory efficiency of the base model, not for the trainable parameters.

Part 6 - Advanced LoRA Techniques

LoRA Merging

After training, merge the adapter into the base weights for zero-overhead inference:

$W_\text{merged} = W_0 + \frac{\alpha}{r} BA$

# Merge LoRA weights into base model
model = model.merge_and_unload()

# Now model is a standard model with no LoRA overhead
# Save as regular model
model.save_pretrained("merged_model")

Multi-Adapter Serving

Serve multiple customers with a single base model:

Multi-Adapter Serving Architecture

LoRA Composition and Arithmetic

You can combine multiple LoRA adapters:

$W = W_0 + \lambda_1 B_1 A_1 + \lambda_2 B_2 A_2$

This enables:

Task arithmetic: Add writing style adapter + domain knowledge adapter
Interpolation: $\lambda \cdot \Delta W_\text{task1} + (1-\lambda) \cdot \Delta W_\text{task2}$
Negation: Subtract an adapter to remove unwanted behavior

DoRA: Weight-Decomposed LoRA

DoRA (2024) decomposes the weight update into magnitude and direction:

$W = m \cdot \frac{W_0 + BA}{\|W_0 + BA\|_c}$

where $m$ is a learnable magnitude vector and $\|\cdot\|_c$ is the column-wise norm. This achieves slightly better performance than LoRA at the same rank.

Part 7 - Practical Implementation Guide

Complete Fine-Tuning Script

import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    BitsAndBytesConfig,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer

# 1. Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer.pad_token = tokenizer.eos_token

# 2. Quantization config (for QLoRA)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# 3. Load model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
model = prepare_model_for_kbit_training(model)

# 4. LoRA config
lora_config = LoraConfig(
    r=16,                    # Rank - start here, increase if quality is insufficient
    lora_alpha=32,           # Alpha = 2 * r is a good default
    target_modules=[         # All linear layers for best quality
        "q_proj", "k_proj", "v_proj", "o_proj",
        "up_proj", "gate_proj", "down_proj",
    ],
    lora_dropout=0.05,       # Small dropout for regularization
    bias="none",             # Don't train biases
    task_type="CAUSAL_LM",
)

# 5. Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# 6. Training arguments
training_args = TrainingArguments(
    output_dir="./lora-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,  # Effective batch size = 16
    learning_rate=2e-4,             # Higher LR for LoRA
    warmup_steps=100,
    logging_steps=10,
    save_strategy="epoch",
    bf16=True,
    optim="paged_adamw_8bit",       # Memory-efficient optimizer
    gradient_checkpointing=True,     # Trade compute for memory
)

# 7. Load dataset and train
dataset = load_dataset("your_dataset")

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    tokenizer=tokenizer,
    max_seq_length=2048,
)

trainer.train()

# 8. Save adapter (only ~30MB)
model.save_pretrained("./lora-adapter")

Hyperparameter Tuning Guide

Hyperparameter	Start	Range	Impact
Rank (r)	16	4-64	Higher = more capacity, more memory
Alpha (α)	2 * r	r to 4*r	Higher = stronger adaptation signal
Learning rate	2e-4	1e-4 to 5e-4	Higher than full FT because fewer params
Dropout	0.05	0-0.1	Higher if overfitting on small datasets
Target modules	All linear	QV only to all linear	More modules = better quality, more params
Epochs	3	1-5	LoRA overfits faster than full FT

Part 8 - When LoRA vs Fine-Tuning vs Prompting

Decision Framework

LoRA Decision Framework

Comparison Table

Factor	Prompt Engineering	LoRA	Full Fine-Tuning
Training cost	$0	$10-100	$1K-100K+
Training time	0	Hours	Days-weeks
Data needed	0-10 examples	100-10K examples	1K-100K+ examples
Quality ceiling	Limited by model	Near full FT	Highest
New knowledge	No	Limited	Yes
Serving cost	Higher (long prompts)	Same as base	Same as base
Iteration speed	Minutes	Hours	Days
Multi-tenant	Easy (different prompts)	Easy (swap adapters)	Hard (separate models)

Part 9 - Practice Problems

Problem 1: LoRA Parameter Calculation

A LLaMA-7B model has 32 transformer layers. Each layer has Q, K, V, O projections (all 4096 x 4096) and an MLP with up (4096 x 11008), gate (4096 x 11008), and down (11008 x 4096) projections. Calculate the total trainable parameters when applying LoRA with rank 16 to all linear layers.

Hint 1 - Direction

For each weight matrix $W \in \mathbb{R}^{d_{out} \times d_{in}}$ , LoRA adds $A \in \mathbb{R}^{r \times d_{in}}$ and $B \in \mathbb{R}^{d_{out} \times r}$ , so the parameter count is $r \times (d_{in} + d_{out})$ .

Full Answer

Per layer, attention projections (Q, K, V, O):

Each: $r \times (4096 + 4096) = 16 \times 8192 = 131,072$
4 projections: $4 \times 131,072 = 524,288$

Per layer, MLP projections (up, gate, down):

Up/Gate: $r \times (4096 + 11008) = 16 \times 15104 = 241,664$ each
Down: $r \times (11008 + 4096) = 16 \times 15104 = 241,664$
3 projections: $3 \times 241,664 = 724,992$

Per layer total: $524,288 + 724,992 = 1,249,280$

All 32 layers: $32 \times 1,249,280 = 39,976,960 \approx 40M$ parameters

As percentage of 7B: $40M / 7000M \approx 0.57\%$

Problem 2: Rank Selection

Your team fine-tuned a 13B model with LoRA rank 4 on a medical QA dataset (5000 examples). The model performs well on general medical questions but fails on rare diseases. Should you increase the rank, add more data, or try a different approach? Justify with theory.

Hint 1 - Direction

Think about what the rank limits: the expressiveness of the weight update. Rare diseases represent tail knowledge that may not be in the pre-training data at all.

Full Answer + Rubric

Analysis: The issue is likely not rank. Rank controls the expressiveness of the adaptation, not the knowledge of the model. If the base model doesn't know about rare diseases, no amount of LoRA adaptation will surface that knowledge - you can't extract information that isn't there.

Recommended approach (in order):

Add more data - Curate examples specifically about rare diseases. LoRA with rank 4 on more targeted data will outperform rank 64 on the same 5000 generic examples.
Continued pre-training + LoRA - First, do continued pre-training on medical literature (including rare disease content) to inject knowledge, then apply LoRA for task adaptation.
RAG - Retrieve rare disease information from a medical knowledge base at inference time. This doesn't require fine-tuning at all.
Only then increase rank - If the model has the knowledge but can't express it, increase rank to 16-32.

Scoring:

Strong Hire: Distinguishes between knowledge (data) and expressiveness (rank), suggests data-first and RAG approaches
Lean Hire: Suggests increasing rank but also mentions data quality
No Hire: Just says "increase the rank to 64"

Problem 3: Multi-Tenant Architecture

Design a serving system for a single 70B base model with 100 LoRA adapters (one per customer). Requirements: <100ms latency, ability to add new customers in minutes, and cost-effective GPU usage.

Hint 1 - Direction

Think about how LoRA adapters interact with batched inference. Can you batch requests across different adapters?

Full Answer + Rubric

Architecture:

Base model: Load once in GPU memory (~35 GB in fp16, or ~20 GB in 4-bit). Shared across all requests.
Adapter storage: Store all 100 adapters on fast storage (SSD). Each is ~50 MB. Total: 5 GB - trivially small.
Hot adapter cache: Keep the 20 most active adapters in GPU memory. Each adapter adds ~50 MB, so 20 adapters = 1 GB.
Request routing: Route incoming requests by customer_id to the appropriate adapter. Use an adapter cache with LRU eviction.
Batching strategy: Requests with the same adapter can be batched normally. For mixed-adapter batching, use frameworks like S-LoRA or Punica that support batched LoRA inference by:
- Computing $W_0 x$ for the full batch (shared)
- Computing adapter-specific $BAx$ with custom CUDA kernels that handle different adapters per sequence
Scaling: For >100ms SLA, use tensor parallelism across 2-4 GPUs. For more customers, replicate the base model.

Scoring:

Strong Hire: Describes shared base model, adapter caching, mixed-adapter batching (S-LoRA/Punica), and concrete memory calculations
Lean Hire: Understands shared base model with swappable adapters but doesn't address batching across adapters
No Hire: Suggests loading separate model copies per customer

Part 10 - The Paper in Context

Key Contributions of the LoRA Paper

Low-rank hypothesis for fine-tuning updates - theoretically motivated and empirically validated
Zero initialization via $B=0$ - ensures training stability
No inference latency - adapter merging is unique among PEFT methods
Composability - multiple adapters on one base model

Impact on the Field

Made fine-tuning accessible to individuals and small teams (QLoRA on a single GPU)
Enabled multi-tenant LLM serving for enterprise
Spawned a family of methods: QLoRA, LoRA+, DoRA, AdaLoRA, LoRA-FA
Became the standard fine-tuning approach for the open-source LLM ecosystem

Timeline

PEFT Evolution Timeline

Interview Cheat Sheet

Question Pattern	Framework	Key Phrases
"Explain LoRA"	Full FT cost → low-rank hypothesis → $W = W_0 + BA$ → init → merging	"LoRA exploits the fact that fine-tuning updates have low intrinsic rank, so we factorize the update into two small matrices"
"Why initialize B to zero?"	Start as identity → training stability → gradual adaptation	"B=0 means the model starts as the exact pretrained model. We don't corrupt learned representations from step 1."
"LoRA vs full fine-tuning?"	Parameter count → memory → quality comparison → when each wins	"LoRA achieves 95-99% of full FT quality with <1% of trainable params. Full FT wins only when you have unlimited compute and need maximum quality."
"What is QLoRA?"	4-bit base model → NF4 quantization → LoRA in higher precision	"QLoRA quantizes the frozen base model to 4-bit, reducing memory 4x, while keeping LoRA adapters in bf16 for training quality."
"How to choose rank?"	Start r=16 → tune based on task complexity → diminishing returns	"I start with r=16 as a default. For simple classification tasks, r=4-8 suffices. Complex reasoning might need r=32-64."
"Multi-tenant serving?"	Shared base model → adapter cache → S-LoRA batching	"One base model in memory, swap LoRA adapters per request. Use S-LoRA for mixed-adapter batching."

Spaced Repetition Checkpoints

Day 0: Read this page. Write the LoRA equation from memory. Explain why B is initialized to zero.
Day 3: Calculate LoRA parameters for a given model. Explain QLoRA's three innovations without looking.
Day 7: Compare LoRA vs adapters vs prefix tuning. Give three reasons LoRA won.
Day 14: Design a multi-tenant LoRA serving architecture from memory. Include memory calculations.
Day 21: Solve all three practice problems. Time yourself - 5-8 minutes each.

Next Steps

Continue to RLHF Papers to understand how LoRA-adapted models are aligned using RLHF
Review Adam Optimizer for understanding the optimizer used during LoRA training
For system design with LoRA, see ML System Design

The Real Interview Moment​

What You Will Master​

Part 1 - The Problem: Full Fine-Tuning Doesn't Scale​

The Cost of Full Fine-Tuning​

The Low-Rank Hypothesis​

Part 2 - The LoRA Math​

Core Formulation​

Parameter Savings​

Initialization​

The Scaling Factor α/r\alpha / rα/r​

Part 3 - Which Layers to Adapt​

Layer Selection Strategy​

Choosing the Rank​

Part 4 - Comparison with Other PEFT Methods​

The PEFT Landscape​

Detailed Comparison​

Why LoRA Won​

Part 5 - QLoRA: The Memory Breakthrough​

The Problem QLoRA Solves​

How QLoRA Works​

The Forward Pass with QLoRA​

Part 6 - Advanced LoRA Techniques​

LoRA Merging​

Multi-Adapter Serving​

LoRA Composition and Arithmetic​

DoRA: Weight-Decomposed LoRA​

Part 7 - Practical Implementation Guide​

Complete Fine-Tuning Script​

Hyperparameter Tuning Guide​

Part 8 - When LoRA vs Fine-Tuning vs Prompting​

Decision Framework​

Comparison Table​

Part 9 - Practice Problems​

Problem 1: LoRA Parameter Calculation​

Problem 2: Rank Selection​

Problem 3: Multi-Tenant Architecture​

Part 10 - The Paper in Context​

Key Contributions of the LoRA Paper​

Impact on the Field​

Timeline​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

Next Steps​

The Real Interview Moment

What You Will Master

Part 1 - The Problem: Full Fine-Tuning Doesn't Scale

The Cost of Full Fine-Tuning

The Low-Rank Hypothesis

Part 2 - The LoRA Math

Core Formulation

Parameter Savings

Initialization

The Scaling Factor $\alpha / r$

Part 3 - Which Layers to Adapt

Layer Selection Strategy

Choosing the Rank

Part 4 - Comparison with Other PEFT Methods

The PEFT Landscape

Detailed Comparison

Why LoRA Won

Part 5 - QLoRA: The Memory Breakthrough

The Problem QLoRA Solves

How QLoRA Works

The Forward Pass with QLoRA

Part 6 - Advanced LoRA Techniques

LoRA Merging

Multi-Adapter Serving

LoRA Composition and Arithmetic

DoRA: Weight-Decomposed LoRA

Part 7 - Practical Implementation Guide

Complete Fine-Tuning Script

Hyperparameter Tuning Guide

Part 8 - When LoRA vs Fine-Tuning vs Prompting

Decision Framework

Comparison Table

Part 9 - Practice Problems

Problem 1: LoRA Parameter Calculation

Problem 2: Rank Selection

Problem 3: Multi-Tenant Architecture

Part 10 - The Paper in Context

Key Contributions of the LoRA Paper

Impact on the Field

Timeline

Interview Cheat Sheet

Spaced Repetition Checkpoints

Next Steps