LoRA and QLoRA: fine-tune 70B models on a single GPU by freezing the base model and training only small low-rank adapter matrices - the technique that democratized LLM customization.

How does qlora work in practice?

LoRA for Efficient Fine-Tuning covers lora, qlora, low rank adaptation from first principles with code examples. Free lesson at https://engineersofai.com/docs/ai-engineering/model-compression/lora-for-efficient-fine-tuning

What is the difference between lora and low rank adaptation?

See the full breakdown at https://engineersofai.com/docs/ai-engineering/model-compression/lora-for-efficient-fine-tuning

:::tip 🎮 Interactive Playground Visualize this concept: Try the LoRA Fine-Tuning demo on the EngineersOfAI Playground - no code required. :::

LoRA: Fine-Tuning 70B Models on a Single Consumer GPU

It's Q3 2023. A three-person AI team at a legal tech startup has just gotten access to a Llama-2-70B model. Their product manager is ecstatic - this is the model they've been waiting for to power their contract analysis feature. The engineering lead opens a cost calculator. Full fine-tuning of 70B parameters in BF16 requires 140 GB of GPU memory just for weights, plus 280 GB for Adam optimizer states, plus activations - well over 500 GB total. That's six A100 80GB GPUs at minimum, assuming you could even fit the optimizer states. The training run would cost around $50,000 on cloud compute, take two weeks, and require infrastructure they didn't have. The PM's enthusiasm fades as the engineering lead explains the numbers.

Then someone on the team reads the QLoRA paper. Tim Dettmers and colleagues had just shown that a 70B parameter model could be fine-tuned on a single A100 80GB GPU. The core technique: freeze the base model entirely (loaded in 4-bit NF4, requiring ~35 GB), and instead of updating the full weight matrices, train tiny "adapter" matrices of rank 16. Where full fine-tuning trains 70 billion parameters, QLoRA trains roughly 80 million - a 875× reduction in trainable parameters. The optimizer states shrink proportionally. The startup ran their first fine-tuning experiment that afternoon.

Six hours later, they had a Llama-2-70B model that had learned to structure contract analysis outputs in their exact format, identify jurisdiction-specific clauses, and flag risky indemnification language - all from 3,000 labeled examples they had curated in-house. The merged adapter weighed 140 MB. This is the story of how LoRA democratized LLM customization, and why understanding it at the implementation level is now table stakes for AI engineers.

The Mathematical Foundation: Why Low Rank Works

Full fine-tuning updates every weight matrix $W \in \mathbb{R}^{d \times k}$ directly:

$W' = W + \Delta W$

The low intrinsic rank hypothesis (Aghajanyan et al., 2020) states that during task-specific fine-tuning, $\Delta W$ has low effective rank. The model doesn't need to change in all $d \times k$ dimensions - only along a small number of important directions. LoRA exploits this by decomposing the update:

$\Delta W = BA \quad \text{where } B \in \mathbb{R}^{d \times r}, \; A \in \mathbb{R}^{r \times k}, \; r \ll \min(d, k)$

The adapted forward pass becomes:

$h = Wx + \frac{\alpha}{r} BAx$

The $\frac{\alpha}{r}$ scaling ensures the update magnitude is independent of the rank choice - you can increase $r$ without automatically increasing the update scale.

Initialization is critical:

$A$ initialized with random Gaussian (small nonzero values to break symmetry)
$B$ initialized to zero - so $\Delta W = BA = 0$ at the start. The model is unchanged at initialization; adaptation happens through training.

For a 7B model layer with $d = k = 4096$ and rank $r = 16$ :

Full weight matrix: $4096 \times 4096 = 16.7M$ parameters
LoRA matrices: $(4096 \times 16) + (16 \times 4096) = 131K$ parameters
Compression: 127× fewer trainable parameters per layer

LoRA Architecture and Implementation

import torch
import torch.nn as nn
import math
from typing import Optional


class LoRALinear(nn.Module):
    """
    Linear layer with LoRA adaptation.

    The base weight W is frozen (requires_grad=False).
    Only lora_A and lora_B are trainable.

    At inference after training, call merge_weights() to fold the adapter
    into the base weight - eliminating the inference overhead.
    """

    def __init__(
        self,
        in_features: int,
        out_features: int,
        rank: int = 16,
        lora_alpha: float = 32.0,
        lora_dropout: float = 0.05,
        bias: bool = True,
    ):
        super().__init__()

        self.in_features = in_features
        self.out_features = out_features
        self.rank = rank
        self.lora_alpha = lora_alpha
        # Scaling: α/r - controls update magnitude independent of rank
        self.scaling = lora_alpha / rank

        # Frozen base weight
        self.weight = nn.Parameter(
            torch.empty(out_features, in_features),
            requires_grad=False,  # FROZEN - never receives gradients
        )
        self.bias = None
        if bias:
            self.bias = nn.Parameter(
                torch.zeros(out_features),
                requires_grad=False,  # Bias also frozen
            )

        # Trainable LoRA matrices
        # A: initialized with Kaiming uniform (small non-zero for gradient flow)
        self.lora_A = nn.Parameter(torch.empty(rank, in_features))
        # B: initialized to zero (ensures ΔW = 0 at start, model unchanged)
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))

        self.lora_dropout = nn.Dropout(p=lora_dropout) if lora_dropout > 0 else nn.Identity()

        # Standard Kaiming initialization for A
        nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))

        # Track whether weights have been merged
        self._merged = False

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Base computation (frozen)
        result = nn.functional.linear(x, self.weight, self.bias)

        if not self._merged:
            # LoRA adaptation: x → dropout → A → B → scale
            lora_x = self.lora_dropout(x)
            # Shape: (batch, seq, in) → (batch, seq, rank) → (batch, seq, out)
            lora_update = lora_x @ self.lora_A.T @ self.lora_B.T
            result = result + lora_update * self.scaling

        return result

    def merge_weights(self) -> None:
        """
        Fold LoRA adaptation into the base weight for zero-overhead inference.

        ΔW = B @ A × scaling → W' = W + ΔW

        After merging, this layer behaves identically to a standard nn.Linear
        with no additional compute cost.
        """
        if self._merged:
            return

        with torch.no_grad():
            # Compute ΔW = B @ A × (α/r)
            delta_w = (self.lora_B @ self.lora_A) * self.scaling
            self.weight.data += delta_w

        self._merged = True
        print(f"  Merged LoRA: ‖ΔW‖ = {delta_w.norm().item():.4f}, "
              f"‖W‖ = {self.weight.norm().item():.4f}")

    def unmerge_weights(self) -> None:
        """Reverse merge - restore adapter-separated state (e.g., for continued training)."""
        if not self._merged:
            return
        with torch.no_grad():
            delta_w = (self.lora_B @ self.lora_A) * self.scaling
            self.weight.data -= delta_w
        self._merged = False

    @property
    def adapter_params(self) -> int:
        return self.lora_A.numel() + self.lora_B.numel()

    @property
    def total_params(self) -> int:
        bias_params = self.bias.numel() if self.bias is not None else 0
        return self.weight.numel() + bias_params

    @property
    def adapter_compression_ratio(self) -> float:
        """Fraction of total parameters that are trainable."""
        return self.adapter_params / self.total_params


def add_lora_to_model(
    model: nn.Module,
    target_modules: list[str],
    rank: int = 16,
    lora_alpha: float = 32.0,
    lora_dropout: float = 0.05,
    verbose: bool = True,
) -> nn.Module:
    """
    Replace target Linear layers with LoRALinear, freeze everything else.

    Args:
        target_modules: Module name substrings to replace (e.g., ["q_proj", "v_proj"])
        rank: LoRA rank r
        lora_alpha: Scaling constant α
        lora_dropout: Dropout probability on the LoRA input path

    Returns:
        Modified model (in-place)
    """
    # Step 1: Freeze all parameters
    for param in model.parameters():
        param.requires_grad = False

    n_replaced = 0
    skipped = []

    # Step 2: Replace target Linear layers with LoRALinear
    for name, module in model.named_modules():
        if not isinstance(module, nn.Linear):
            continue
        if not any(target in name for target in target_modules):
            continue

        # Navigate to parent module
        parts = name.split(".")
        parent = model
        for part in parts[:-1]:
            parent = getattr(parent, part)
        attr_name = parts[-1]

        # Build LoRA layer
        lora_layer = LoRALinear(
            in_features=module.in_features,
            out_features=module.out_features,
            rank=rank,
            lora_alpha=lora_alpha,
            lora_dropout=lora_dropout,
            bias=module.bias is not None,
        )

        # Copy base weights
        with torch.no_grad():
            lora_layer.weight.data.copy_(module.weight.data)
            if module.bias is not None:
                lora_layer.bias.data.copy_(module.bias.data)

        setattr(parent, attr_name, lora_layer)
        n_replaced += 1

        if verbose:
            print(f"  LoRA: {name} [{module.in_features}×{module.out_features}] "
                  f"r={rank} → {lora_layer.adapter_params:,} trainable params")

    # Count results
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total = sum(p.numel() for p in model.parameters())

    print(f"\nLoRA summary:")
    print(f"  Replaced {n_replaced} layers | {len(skipped)} skipped")
    print(f"  Trainable: {trainable:,} / {total:,} params ({trainable/total*100:.3f}%)")
    print(f"  Memory for optimizer states: ~{trainable * 8 / 1024**3:.2f} GB (Adam, FP32)")

    return model

QLoRA: Fine-Tuning at 4-Bit Precision

QLoRA (Dettmers et al., 2023) stacks three techniques to achieve the memory efficiency needed to fine-tune 70B models on a single GPU:

NF4 quantization for the frozen base model (4 bits per parameter)
Double quantization to compress the quantization scale constants themselves
Paged optimizer to handle occasional memory spikes by offloading optimizer states to CPU RAM

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType, prepare_model_for_kbit_training
import torch


def load_model_for_qlora(
    model_name: str,
    rank: int = 16,
    lora_alpha: float = 32.0,
    lora_dropout: float = 0.05,
    target_modules: Optional[list[str]] = None,
    use_double_quant: bool = True,
) -> tuple:
    """
    Load a model with full QLoRA configuration.

    Memory breakdown for Llama-2-70B:
      - Base model in NF4: ~35 GB
      - Double quantization savings: ~0.5 GB
      - LoRA adapters (r=16, 7 module types × 80 layers): ~0.16 GB in BF16
      - Optimizer states (paged AdamW for adapters only): ~1.3 GB
      - Gradient checkpointing (√n_layers activations): ~5-8 GB
      Total: ~42-45 GB - fits on A100 80GB

    Args:
        target_modules: Which linear layers get LoRA adapters.
                        None = use all standard projection layers.
    """
    if target_modules is None:
        # For Llama-style models: all projection layers
        target_modules = [
            "q_proj", "k_proj", "v_proj", "o_proj",  # Attention
            "gate_proj", "up_proj", "down_proj",       # MLP (SwiGLU FFN)
        ]

    # 4-bit NF4 quantization configuration
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",              # NormalFloat4: optimal for normal distributions
        bnb_4bit_compute_dtype=torch.bfloat16,  # Compute in BF16 during forward pass
        bnb_4bit_use_double_quant=use_double_quant,  # Quantize scale constants too
    )

    # Load base model in 4-bit
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto",  # Distribute across available GPUs automatically
        trust_remote_code=False,
    )

    tokenizer = AutoTokenizer.from_pretrained(model_name)

    # Critical: pad_token must be set for batch training
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
        model.config.pad_token_id = model.config.eos_token_id

    # Prepare for k-bit training:
    # - Enables gradient checkpointing for activation memory savings
    # - Casts LayerNorm layers to FP32 (critical for stable training)
    # - Casts lm_head output to FP32 (prevents loss overflow)
    model = prepare_model_for_kbit_training(
        model,
        use_gradient_checkpointing=True,
    )

    # LoRA configuration
    lora_config = LoraConfig(
        r=rank,
        lora_alpha=lora_alpha,
        target_modules=target_modules,
        lora_dropout=lora_dropout,
        bias="none",             # Don't train bias - rarely helps, adds parameters
        task_type=TaskType.CAUSAL_LM,
        # inference_mode=False is the default during training
    )

    # Inject LoRA adapters into the model
    model = get_peft_model(model, lora_config)

    # Print parameter summary
    model.print_trainable_parameters()
    # Example output for Llama-2-7B, r=16:
    # trainable params: 10,551,296 || all params: 3,752,071,168 || trainable%: 0.281

    return model, tokenizer


def compute_qlora_memory_requirements(
    model_params_b: float,
    rank: int = 16,
    n_layers: int = 32,
    n_lora_modules_per_layer: int = 7,
    hidden_dim: int = 4096,
    seq_len: int = 2048,
    batch_size: int = 4,
) -> dict:
    """
    Estimate GPU memory requirements for QLoRA training.

    Args:
        model_params_b: Total model parameters in billions (e.g., 7.0 for 7B)
        rank: LoRA rank
        n_layers: Number of transformer layers
        n_lora_modules_per_layer: Number of linear modules per layer with adapters
        hidden_dim: Model hidden dimension
        seq_len: Training sequence length
        batch_size: Per-device batch size
    """
    import math

    # Base model: NF4 (4 bits = 0.5 bytes per param)
    base_model_gb = model_params_b * 1e9 * 0.5 / 1024**3

    # Double quantization saves ~37.5 bits per 64-parameter block
    # Rough estimate: 0.37 additional bytes per param → saves about 3%
    double_quant_savings_gb = base_model_gb * 0.03

    # LoRA adapter parameters (A and B matrices), stored in BF16 (2 bytes)
    # Each module adds: (r × d_in) + (d_out × r) params
    # Approximate: 2 × rank × hidden_dim per module
    lora_params = n_layers * n_lora_modules_per_layer * 2 * rank * hidden_dim
    lora_gb = lora_params * 2 / 1024**3  # BF16

    # Optimizer states: paged AdamW = 2 momentum terms × BF16 (2 bytes each)
    # = 4 bytes per LoRA param (paged_adamw uses 8 bytes, but often 16-bit states)
    optimizer_gb = lora_params * 8 / 1024**3  # Adam: 2 states × FP32

    # Gradients: only for LoRA params (BF16 = 2 bytes)
    gradient_gb = lora_params * 2 / 1024**3

    # Activations with gradient checkpointing:
    # Instead of storing all N layer activations, store only √N checkpoints
    # Each checkpoint: (batch × seq_len × hidden_dim × 2 bytes)
    n_checkpoints = math.ceil(math.sqrt(n_layers))
    activation_gb = (
        n_checkpoints * batch_size * seq_len * hidden_dim * 2 / 1024**3
    )

    total_gb = (
        base_model_gb
        - double_quant_savings_gb
        + lora_gb
        + optimizer_gb
        + gradient_gb
        + activation_gb
    )

    gpu_recommendations = {
        total_gb < 16: "RTX 4080 (16 GB)",
        total_gb < 24: "RTX 3090/4090 (24 GB)",
        total_gb < 40: "A6000 (48 GB) or 2× RTX 4090",
        total_gb < 80: "A100 80GB (single GPU)",
        total_gb >= 80: f"Multi-GPU setup needed: {math.ceil(total_gb/80)} × A100 80GB",
    }
    gpu_rec = next((v for k, v in gpu_recommendations.items() if k), "Unknown")

    return {
        "base_model_nf4_gb": round(base_model_gb, 2),
        "double_quant_savings_gb": round(double_quant_savings_gb, 2),
        "lora_adapters_bf16_gb": round(lora_gb, 3),
        "optimizer_states_gb": round(optimizer_gb, 3),
        "gradient_gb": round(gradient_gb, 3),
        "activations_gb": round(activation_gb, 2),
        "total_estimated_gb": round(total_gb, 1),
        "lora_trainable_params": f"{lora_params/1e6:.1f}M",
        "recommended_gpu": gpu_rec,
        "vs_full_finetune_fp16_gb": round(model_params_b * 16, 1),
    }


# Example usage:
# 7B model
result_7b = compute_qlora_memory_requirements(7.0, rank=16, n_layers=32, hidden_dim=4096)
# total: ~8-10 GB → fits on RTX 3090/4090

# 70B model
result_70b = compute_qlora_memory_requirements(70.0, rank=16, n_layers=80, hidden_dim=8192)
# total: ~40-45 GB → fits on A100 80GB

The QLoRA Training Loop: Production Configuration

from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
from datasets import Dataset
import torch


def create_instruction_dataset(
    examples: list[dict],
    tokenizer,
    system_prompt: str = None,
    max_length: int = 2048,
) -> Dataset:
    """
    Format data for instruction fine-tuning using the ChatML format.

    Input format:
        examples = [{"instruction": "...", "input": "...", "output": "..."}]

    Output: tokenized dataset with loss mask applied to prompt tokens
    (only the response part contributes to the loss - critical for instruction tuning).
    """
    formatted_texts = []

    for ex in examples:
        # Build prompt using ChatML format (works with most base models)
        parts = []

        if system_prompt:
            parts.append(f"<|im_start|>system\n{system_prompt}<|im_end|>")

        user_content = ex["instruction"]
        if ex.get("input"):
            user_content += f"\n\n{ex['input']}"

        parts.append(f"<|im_start|>user\n{user_content}<|im_end|>")
        parts.append(f"<|im_start|>assistant\n{ex['output']}<|im_end|>")

        formatted_texts.append("\n".join(parts))

    def tokenize_function(examples):
        return tokenizer(
            examples["text"],
            max_length=max_length,
            truncation=True,
            padding=False,  # DataCollator handles padding
            return_overflowing_tokens=False,
        )

    raw_dataset = Dataset.from_dict({"text": formatted_texts})
    return raw_dataset.map(
        tokenize_function,
        batched=True,
        remove_columns=["text"],
        desc="Tokenizing",
    )


def train_qlora(
    model,
    tokenizer,
    train_examples: list[dict],
    eval_examples: list[dict] = None,
    output_dir: str = "./qlora_adapter",
    n_epochs: int = 3,
    per_device_batch_size: int = 2,
    gradient_accumulation_steps: int = 8,  # Effective batch = 2 × 8 = 16
    learning_rate: float = 2e-4,
    max_seq_length: int = 2048,
    system_prompt: str = None,
    warmup_ratio: float = 0.03,
) -> None:
    """
    Full QLoRA training loop with production-ready configuration.

    Key decisions explained:
    - paged_adamw_32bit: Like AdamW but stores optimizer state in managed memory
      that can be paged to CPU when GPU memory is tight - critical for large models
    - bf16=True: BF16 for adapter training (wider dynamic range than FP16,
      less likely to cause gradient overflow)
    - gradient_checkpointing=True: 20-30% slower but 3-5× less activation memory
    - cosine scheduler: Better convergence than linear for LoRA (smooth decay)
    """
    train_dataset = create_instruction_dataset(
        train_examples, tokenizer, system_prompt, max_seq_length
    )
    eval_dataset = None
    if eval_examples:
        eval_dataset = create_instruction_dataset(
            eval_examples, tokenizer, system_prompt, max_seq_length
        )

    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=False,  # Causal LM, not masked LM
        pad_to_multiple_of=8,  # Efficient padding for Tensor Cores
    )

    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=n_epochs,
        per_device_train_batch_size=per_device_batch_size,
        gradient_accumulation_steps=gradient_accumulation_steps,
        learning_rate=learning_rate,
        lr_scheduler_type="cosine",
        warmup_ratio=warmup_ratio,
        # Optimizer: paged_adamw_32bit for memory-efficient training
        # Alternative: adamw_bnb_8bit (even less memory, slight accuracy tradeoff)
        optim="paged_adamw_32bit",
        fp16=False,
        bf16=True,  # LoRA adapters compute in BF16
        # Gradient checkpointing: recompute activations instead of storing all
        gradient_checkpointing=True,
        gradient_checkpointing_kwargs={"use_reentrant": False},  # Modern, faster variant
        # Logging and saving
        logging_steps=20,
        save_strategy="steps",
        save_steps=200,
        save_total_limit=3,
        evaluation_strategy="steps" if eval_dataset else "no",
        eval_steps=200 if eval_dataset else None,
        load_best_model_at_end=bool(eval_dataset),
        # Performance
        dataloader_num_workers=4,
        dataloader_pin_memory=True,
        # Reporting - disable wandb unless configured
        report_to="none",
        # Max gradient norm: clip to prevent exploding gradients
        max_grad_norm=0.3,
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        data_collator=data_collator,
        tokenizer=tokenizer,
    )

    print(f"Training configuration:")
    print(f"  Effective batch size: {per_device_batch_size * gradient_accumulation_steps}")
    print(f"  Train examples: {len(train_examples)}")
    print(f"  Expected steps: {len(train_examples) * n_epochs // (per_device_batch_size * gradient_accumulation_steps)}")

    trainer.train()

    # Save adapter only (not the 4-bit base model)
    model.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)

    print(f"\nLoRA adapter saved to {output_dir}/")
    print(f"To use: load base model, then PeftModel.from_pretrained(model, '{output_dir}')")

Merging and Serving LoRA Adapters

After training, you have two deployment options. Merging eliminates inference overhead; keeping adapters separate enables multi-task serving.

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch


def merge_lora_for_deployment(
    base_model_name: str,
    adapter_path: str,
    output_path: str,
    save_dtype: torch.dtype = torch.float16,
) -> None:
    """
    Merge LoRA adapter into base model weights for zero-overhead inference.

    W_merged = W + B @ A × (α/r)

    The merged model is a standard Transformer model - no PEFT dependency at inference.
    Size: same as the original base model (adapter adds only ~100-200 MB then disappears).

    When to merge:
    - Single-task deployment (no need to switch adapters)
    - Serving via vLLM, TGI, or other engines that prefer standard models
    - Distributing the fine-tuned model as a standalone artifact
    """
    print(f"Loading base model: {base_model_name}")
    base_model = AutoModelForCausalLM.from_pretrained(
        base_model_name,
        torch_dtype=save_dtype,
        device_map="auto",
        low_cpu_mem_usage=True,
    )

    print(f"Loading adapter: {adapter_path}")
    model = PeftModel.from_pretrained(base_model, adapter_path)

    print("Merging adapter weights...")
    model = model.merge_and_unload()  # Merges BA into W, removes PEFT overhead

    print(f"Saving merged model to {output_path}...")
    model.save_pretrained(
        output_path,
        safe_serialization=True,  # Save as safetensors (safer + faster loading)
        max_shard_size="4GB",     # Split large models into 4GB shards
    )

    tokenizer = AutoTokenizer.from_pretrained(base_model_name)
    tokenizer.save_pretrained(output_path)

    total_params = sum(p.numel() for p in model.parameters())
    size_gb = total_params * (2 if save_dtype == torch.float16 else 4) / 1024**3
    print(f"Merged model: {total_params/1e9:.1f}B params, ~{size_gb:.1f} GB")


class MultiAdapterServer:
    """
    Serve multiple LoRA adapters on a single base model instance.

    Memory pattern:
    - Load base model once (e.g., 14 GB for 7B in FP16)
    - Each adapter adds ~100-200 MB
    - 10 adapters = ~15-16 GB total vs 140 GB for 10 separate full models

    Ideal for: serving multiple departments/clients with task-specific adapters.
    """

    def __init__(
        self,
        base_model_name: str,
        adapter_paths: dict[str, str],
        default_adapter: str = None,
        load_in_8bit: bool = False,
    ):
        """
        Args:
            adapter_paths: {"adapter_name": "/path/to/adapter", ...}
            default_adapter: Name of adapter to use when not specified
        """
        print(f"Loading base model: {base_model_name}")
        self.base_model = AutoModelForCausalLM.from_pretrained(
            base_model_name,
            load_in_8bit=load_in_8bit,
            torch_dtype=torch.float16,
            device_map="auto",
        )
        self.tokenizer = AutoTokenizer.from_pretrained(base_model_name)

        # Wrap with PEFT for multi-adapter support
        if adapter_paths:
            first_name = next(iter(adapter_paths))
            first_path = adapter_paths[first_name]
            self.model = PeftModel.from_pretrained(
                self.base_model,
                first_path,
                adapter_name=first_name,
            )

            # Load remaining adapters
            for name, path in list(adapter_paths.items())[1:]:
                self.model.load_adapter(path, adapter_name=name)
                print(f"  Loaded adapter: {name} from {path}")
        else:
            self.model = self.base_model

        self.current_adapter = default_adapter or (next(iter(adapter_paths)) if adapter_paths else None)
        self.model.eval()

    def switch_adapter(self, adapter_name: str) -> None:
        """Hot-swap to a different adapter (sub-millisecond - just changes pointer)."""
        self.model.set_adapter(adapter_name)
        self.current_adapter = adapter_name

    def generate(
        self,
        prompt: str,
        adapter_name: str = None,
        max_new_tokens: int = 512,
        temperature: float = 0.7,
        do_sample: bool = True,
    ) -> str:
        """Generate with optional adapter selection."""
        if adapter_name and adapter_name != self.current_adapter:
            self.switch_adapter(adapter_name)

        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)

        with torch.no_grad():
            output_ids = self.model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                temperature=temperature,
                do_sample=do_sample,
                pad_token_id=self.tokenizer.pad_token_id,
            )

        generated = output_ids[0][inputs.input_ids.shape[1]:]
        return self.tokenizer.decode(generated, skip_special_tokens=True)

LoRA Variants: When the Original Isn't Enough

Method	Key Innovation	Trainable Params	Best Use Case
LoRA	Fixed low-rank ΔW = BA	r × (d_in + d_out) per layer	General fine-tuning, baseline
QLoRA	4-bit base + LoRA	Same as LoRA	Memory-constrained training
AdaLoRA	SVD-based adaptive rank per layer	Varies (budget allocated)	Complex tasks needing uneven capacity
DoRA	Decompose W into magnitude + direction, adapt both	~2× LoRA params	Better performance at same rank
rsLoRA	Scale by 1/√r instead of 1/r	Same as LoRA	Stable training at high ranks (r≥64)
LoRA+	Different learning rates for A (higher) and B (lower)	Same as LoRA	Small improvement on original
LoftQ	Initialize LoRA around quantized base	Same as LoRA	Better accuracy when base is quantized

# AdaLoRA: Dynamically allocate rank budget across layers
from peft import AdaLoraConfig, get_peft_model, TaskType

adalora_config = AdaLoraConfig(
    init_r=12,           # Starting rank before budget allocation
    target_r=8,          # Target average rank after pruning
    beta1=0.85,          # EMA coefficient for importance tracking
    beta2=0.85,
    tinit=200,           # Steps before pruning starts (let model learn first)
    tfinal=1000,         # Steps when pruning is complete
    deltaT=10,           # Prune every N steps
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    orth_reg_weight=0.5, # Orthogonality regularization (stabilizes SVD pruning)
    task_type=TaskType.CAUSAL_LM,
)

# DoRA: Weight-decomposed LoRA - decomposes W into magnitude × direction
# direction is adapted with LoRA; magnitude is a separate trainable scalar
from peft import LoraConfig

dora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    use_dora=True,   # Enables DoRA decomposition
    target_modules=["q_proj", "v_proj"],
    task_type=TaskType.CAUSAL_LM,
)

# rsLoRA: Scaled by 1/√r for stable training at high ranks
rslora_config = LoraConfig(
    r=64,            # High rank - rsLoRA makes this stable
    lora_alpha=64,
    use_rslora=True, # Changes scaling from α/r to α/√r
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    task_type=TaskType.CAUSAL_LM,
)

Choosing the Right LoRA Hyperparameters

LORA_CONFIGURATION_GUIDE = {
    "rank_selection": {
        "rule_of_thumb": "Start with r=16; increase if val_loss plateaus early",
        "task_to_rank": {
            "text_classification": "r=4 to r=8",
            "named_entity_recognition": "r=8 to r=16",
            "summarization": "r=8 to r=16",
            "instruction_following": "r=16 to r=32",
            "code_generation": "r=32 to r=64",
            "mathematical_reasoning": "r=64 to r=128",
        },
        "diagnostic": (
            "Compute SVD of LoRA adapters after training: "
            "if singular values drop sharply after top 4, rank was too high. "
            "If they're flat across all r values, rank is too low."
        ),
    },
    "alpha_selection": {
        "common_choices": {
            "alpha = rank": "Scaling = 1.0. Conservative. Training may be slow.",
            "alpha = 2 × rank": "Scaling = 2.0. Most common recommendation. Good balance.",
            "alpha = 4 × rank": "Aggressive. Risk of instability if LR is high.",
        },
        "with_rslora": "alpha = rank or alpha = 2×rank (scaling changes to α/√r)",
    },
    "target_modules": {
        "minimal_memory": ["q_proj", "v_proj"],      # Original LoRA paper (2 of 7 modules)
        "standard": ["q_proj", "k_proj", "v_proj", "o_proj"],     # Attention only
        "comprehensive": ["q_proj", "k_proj", "v_proj", "o_proj",  # Attention + FFN
                          "gate_proj", "up_proj", "down_proj"],
        "recommendation": (
            "Use comprehensive for code/reasoning tasks. "
            "Standard for conversational/instruction tasks. "
            "Minimal only when GPU memory is extremely tight."
        ),
    },
    "learning_rate": {
        "qlora_4bit": "1e-4 to 3e-4 (adapters start near zero, need stronger signal)",
        "lora_fp16": "5e-5 to 1e-4",
        "schedule": "cosine decay with 3-5% warmup",
        "warmup_note": "Warmup prevents early gradient explosion when adapters are near zero",
    },
    "common_mistakes": [
        "No gradient checkpointing → OOM during backward pass",
        "Standard AdamW instead of paged_adamw_32bit → wasted VRAM on optimizer states",
        "Not masking prompt tokens in loss → model learns to copy prompts, not respond",
        "lora_dropout > 0.1 for small datasets → too much regularization, underfitting",
        "Training for > 5 epochs on small datasets → catastrophic forgetting of base capabilities",
        "Not evaluating on held-out set → overfitting to training format, not task",
    ],
}

:::danger Gradient Checkpointing is Mandatory for QLoRA Without gradient checkpointing, backward pass requires storing activations for ALL transformer layers simultaneously. For a 70B model with batch_size=2, seq_len=2048: ~80 layers × 2048 × 8192 × 2 bytes = ~5.5 GB per sample. With batch_size=2: ~11 GB just for activations, in addition to the 35 GB base model. Gradient checkpointing recomputes activations layer-by-layer during backward, keeping only √(n_layers) checkpoints active at once. Enable with gradient_checkpointing=True in TrainingArguments AND prepare_model_for_kbit_training(model, use_gradient_checkpointing=True). The cost: 20-30% longer training time. Without it, QLoRA on 70B models is physically impossible on even two A100s. :::

:::warning LoRA Alpha Scaling Is Commonly Misunderstood Setting lora_alpha = rank gives a scaling factor of $\alpha/r = 1.0$ - the LoRA update is applied without amplification. Many tutorials recommend this for "simplicity." However: when rank is increased (say from r=16 to r=64), the per-dimension contribution of each adapter dimension decreases by 4×. With alpha=rank, the total adapter contribution stays constant regardless of rank, but this means each individual direction has 4× less influence - making high-rank adapters behave like low-rank ones. Use lora_alpha = 2 × rank to maintain consistent update magnitude as you change rank. For rsLoRA, use use_rslora=True which changes scaling to $\alpha/\sqrt{r}$ - stable across the full rank range. :::

:::tip Use LoftQ for Better QLoRA Initialization Standard QLoRA initializes LoRA adapters with random A and zeros for B, starting from a state where ΔW=0. But the 4-bit quantization has already introduced error into W - so the model starts from a degraded state. LoftQ (Liu et al., 2023) initializes the LoRA adapters to compensate for the quantization error: B@A is initialized to approximate W - W_quantized. This gives QLoRA a head start and typically improves final accuracy by 1-3% on reasoning tasks, for free. :::

Interview Questions

Q: Why does LoRA work? What is the "low intrinsic rank" hypothesis?

A: The low intrinsic rank hypothesis (Aghajanyan et al., 2020) states that the weight update $\Delta W$ required for task-specific fine-tuning has low effective dimensionality. Empirically: even for complex tasks, the model only needs to change along a small number of directions in parameter space. A 1B-parameter model fine-tuned for a specific task might only need ~1000 "effective dimensions" of adaptation - vastly fewer than 1 billion.

LoRA operationalizes this: $\Delta W = BA$ where $B \in \mathbb{R}^{d \times r}$ , $A \in \mathbb{R}^{r \times k}$ , with $r \ll \min(d,k)$ . The rank $r$ controls how many independent directions of adaptation are allowed. For classification or simple instruction following, $r=4$ to $r=8$ is usually sufficient. For complex reasoning or code, $r=32$ to $r=64$ may be needed. The evidence: LoRA with $r=16$ achieves 95%+ of full fine-tuning performance on most standard benchmarks, suggesting that the full $d \times k$ update matrix was mostly redundant. The remaining 5% lives in very small additional dimensions that require full fine-tuning to capture.

Q: How does QLoRA achieve such dramatic memory reduction compared to full fine-tuning?

A: QLoRA stacks three independent memory reductions:

NF4 base model (4× reduction): A 70B model in FP16 requires 140 GB. In NF4 (4 bits), it requires ~35 GB. NF4 is specifically designed for normally-distributed weights - it uses quantization levels at equal quantile intervals of a standard Gaussian, minimizing quantization error for normally-distributed values.
Tiny trainable parameter set (700× reduction in optimizer cost): LoRA with r=16 on a 70B model trains ~80M parameters instead of 70B. Adam optimizer states are 8 bytes per trainable parameter - 80M × 8 = 640 MB instead of 70B × 8 = 560 GB. The optimizer states are the most dominant cost in full fine-tuning.
Gradient checkpointing (4-8× reduction in activation cost): Instead of storing all layer activations for backprop, recompute them during backward. Adds 20-30% training time but reduces activation memory from O(n_layers) to O(√n_layers).

Combined: ~500+ GB for full FP16 fine-tuning → ~40-45 GB for QLoRA. This is the difference between needing a $500,000 GPU cluster and a single$ 15,000 A100.

Q: What are the tradeoffs between merging LoRA adapters at deployment vs. keeping them separate?

A: The decision depends on your serving architecture:

Merged adapters ( $W' = W + \frac{\alpha}{r}BA$ ): Zero inference overhead - the adapter is folded into the base weight, and forward passes are identical to the original model. One model artifact per task. Cannot switch adapters without reloading. Best for: single-task production deployments, models served by vLLM or TGI which expect standard model formats, distributing fine-tuned models as standalone artifacts.

Separate adapters (LoRA overhead at inference): ~5-10% slower inference due to the additional $BAx$ computation per adapted layer. One base model + N small adapter files (each ~100-200 MB) instead of N full models (each 14 GB for 7B). Enables hot-swapping between adapters without reloading the base model - critical for serving many task-specific variants. LoRA merge/interpolation is possible (combine multiple adapters for multi-task). Best for: serving multiple clients or departments with task-specific adapters, A/B testing adapters, rapid adapter iteration.

Q: How does AdaLoRA differ from standard LoRA, and when does the complexity pay off?

A: Standard LoRA assigns the same rank $r$ to every adapted layer. But different transformer layers contribute differently to task adaptation - early layers often learn syntactic/positional patterns (need less task-specific adaptation), while middle/late layers learn semantic content (need more). Assigning equal rank to all layers wastes parameter budget on layers that don't need it.

AdaLoRA starts with a higher rank for all layers, then uses SVD-based importance scoring to progressively prune singular values during training. Layers with redundant adapter dimensions have their rank decreased; layers where all singular values are significant keep higher rank. A global rank budget is maintained across all layers.

When it pays off: (1) your standard LoRA model underperforms despite using r=32 or higher - suggesting uneven capacity needs, (2) you have a strict parameter budget and need to maximize accuracy within it. The downside: AdaLoRA introduces four additional hyperparameters (init_r, target_r, tinit, tfinal) that each require tuning. For most practical fine-tuning tasks, standard LoRA with a well-chosen rank (r=16 to r=32) performs within 1% of AdaLoRA with far less complexity.

Q: What is DoRA (Weight-Decomposed LoRA) and why might it outperform standard LoRA?

A: DoRA decomposes the weight matrix $W$ into two components: a column-wise magnitude vector $m$ and a normalized direction matrix $V$ , such that $W = m \cdot V$ where $V$ has unit-norm columns. Standard fine-tuning can freely adjust both magnitude and direction of each column. But LoRA's $\Delta W = BA$ forces magnitude and direction changes to be coupled - the low-rank structure constrains how both can change simultaneously.

DoRA trains $m$ directly (a full-rank but tiny vector - just one scalar per output neuron) and uses LoRA to adapt $V$ (the direction component). This decoupling means: magnitude can be adjusted independently with minimal parameters, while direction adapts via the low-rank LoRA component. Empirically: DoRA achieves better performance than LoRA at the same rank, typically 1-3% improvement on reasoning tasks. The cost: roughly 2× the adapter parameters of standard LoRA (magnitude vector + LoRA matrices). Use DoRA when: you need maximum accuracy from a given parameter budget, your task involves significant output magnitude shifts (common in code generation where the model must switch between very different output styles).

Production LoRA Deployment Checklist

LORA_PRODUCTION_CHECKLIST = {
    "before_training": [
        "Save initial weights if you plan to use iterative pruning later",
        "Set tokenizer.pad_token = tokenizer.eos_token (required for batched training)",
        "Verify GPU memory budget: base_model_gb + lora_gb + optimizer_gb < 90% of VRAM",
        "Use gradient_checkpointing=True - always, even if you think you have enough memory",
        "Prepare calibration/validation set (10-20% of training set) for early stopping",
        "Log training loss and validation loss every 50-100 steps minimum",
    ],
    "training_monitoring": [
        "Training loss decreasing smoothly? Good.",
        "Validation loss plateauing before training loss? Early stopping triggered correctly?",
        "Gradient norm > 1.0 frequently? Reduce learning rate or increase warmup steps",
        "GPU utilization < 80%? Batch size is too small - increase gradient_accumulation_steps",
        "Loss spike at step N? Check if learning rate scheduler is misconfigured",
    ],
    "after_training_validation": [
        "Load adapter and run inference on 5-10 held-out examples manually",
        "Check that the model doesn't hallucinate your system prompt format",
        "Verify the model responds in the correct language/format for your task",
        "Run automated eval suite (accuracy benchmark on held-out test set)",
        "Compare to baseline model on the same test set",
    ],
    "deployment_decisions": {
        "single_task_high_traffic": "Merge adapter into base model - zero inference overhead",
        "multi_task_moderate_traffic": "Keep adapter separate - hot-swap between tasks",
        "extremely_high_traffic": "Merge AND quantize (AWQ/GPTQ) the merged model",
        "multi_tenant_saas": (
            "Keep adapters separate per tenant - load base model once, "
            "swap adapter per request. Significant memory savings."
        ),
    },
    "common_production_failures": [
        "Adapter trained on padded sequences has bad performance on non-padded inference "
        "- always strip padding at inference time",
        "Tokenizer mismatch: saved adapter with one tokenizer, loading with another. "
        "Always save tokenizer alongside adapter.",
        "PEFT version mismatch between training and serving environments. "
        "Pin PEFT version in requirements.txt.",
        "Missing prepare_model_for_kbit_training() call when doing QLoRA - "
        "causes NaN losses immediately",
        "lora_dropout set to 0.0 at inference but left > 0 at training - "
        "model.eval() sets dropout to 0, so this is actually fine automatically",
    ],
}

:::info LoRA Adapters are Tiny but the Base Model is Not When distributing a LoRA-trained model, you distribute two artifacts: the adapter (~100-400 MB) and the base model (7-140 GB). If your users need to download both, the base model dominates. Three patterns solve this: (1) Merge and share: share one merged model file (~14 GB for 7B). Users download once, no PEFT dependency. (2) HuggingFace Hub: publish adapter to Hub pointing to the exact base model version (e.g., meta-llama/Llama-2-7b). Users pull base model from Hub and overlay your adapter - separate downloads that can be cached. (3) Serve via API: host the merged or multi-adapter setup on your servers. Users make API calls, never need model weights locally. For B2B scenarios where IP protection matters: option 3 is usually required - sharing the adapter + base model effectively distributes your fine-tuned capability to anyone who can merge them. :::

The Mathematical Foundation: Why Low Rank Works​

LoRA Architecture and Implementation​

QLoRA: Fine-Tuning at 4-Bit Precision​

The QLoRA Training Loop: Production Configuration​

Merging and Serving LoRA Adapters​

LoRA Variants: When the Original Isn't Enough​

Choosing the Right LoRA Hyperparameters​

Interview Questions​

Production LoRA Deployment Checklist​

Summary: LoRA Decision Tree​