Selecting Target Modules and Rank

The Production Decision That Determines Everything

It is 2:47 AM and your fine-tuned model is disappointing. The loss curves looked great during training. Eval perplexity dropped cleanly. But when you run it on real prompts, the outputs feel off - slightly robotic, occasionally incoherent, missing the domain-specific nuance you were trying to inject. You have been training for three days. Your GPU bill is mounting. And you are staring at two config lines that you copy-pasted from a tutorial without really understanding them:

target_modules=["q_proj", "v_proj"],
r=8,

Those two decisions - which modules to target and what rank to use - are not boilerplate. They are the core architectural choices of any LoRA run. Everything else (learning rate, batch size, epochs) matters less. If you target the wrong layers or pick a rank that is too low for your task complexity, no amount of hyperparameter tuning will save you. If you pick a rank that is too high, you are wasting memory and risking overfitting on a small dataset.

This lesson is about understanding those two decisions from first principles, so you never copy-paste them blindly again. By the end, you will know exactly which modules matter for which tasks, how to pick rank before training starts, and how to run a systematic ablation when you are not sure.

The knowledge here comes from hundreds of community fine-tuning runs, the original LoRA paper (Hu et al., 2021), the QLoRA paper (Dettmers et al., 2023), RSLoRA (Kalajdzievski, 2023), DoRA (Liu et al., 2024), and LoRA+ (Hayou et al., 2024). This is not tutorial knowledge. This is what separates practitioners who get good results from those who wonder why their model did not improve.

Why This Exists - The Problem Before LoRA Targeting

When LoRA was first proposed, the default advice was simple: apply it to the attention weight matrices. That advice came directly from the original paper, which tested on GPT-3 and found that targeting $W_q$ and $W_v$ alone was sufficient for most NLP tasks. The paper showed that even rank $r=1$ achieved reasonable results, and rank $r=4$ matched full fine-tuning on many benchmarks.

But that advice was calibrated for GPT-3-scale models on relatively narrow tasks like translation and summarization. When the open-source community started fine-tuning 7B, 13B, and 70B models for instruction following, code generation, and complex domain adaptation, the "just target q and v" approach started to show cracks.

Models fine-tuned with only attention LoRA would follow instructions reasonably but miss domain-specific vocabulary. Models trained on code data would handle syntax but struggle with multi-step reasoning. The problem was structural: the MLP (feed-forward network) layers in a transformer store factual knowledge and domain-specific associations, while attention layers handle how information is routed and combined. If your task requires new knowledge, targeting only attention is insufficient.

The community ran thousands of ablation experiments throughout 2023 and 2024. Papers like QLoRA, LLaMA-Adapter, and various community blog posts quantified exactly which modules mattered for which tasks. The answer turned out to be nuanced: it depends on what you are trying to change about the model's behavior.

Historical Context - From GPT-3 Experiments to Llama 3

The original LoRA paper (Hu et al., 2021) was calibrated entirely on GPT-3. The authors had a specific constraint: they were working with OpenAI's API and could not modify the model during inference. They needed a way to adapt the model without changing any weights at deploy time. The adapter had to be injected into specific layers and removed cleanly.

In that context, the choice to target only $W_q$ and $W_v$ was pragmatic. These matrices were well-understood from the attention mechanism literature, and the paper showed empirically that adding $W_k$ and $W_o$ provided marginal improvement. For the tasks they tested (GLUE, E2E NLG, WikiSQL), the delta was less than 0.5 points on average.

The "aha moment" for the broader community came in late 2022 and early 2023 when fine-tuning LLaMA and Alpaca models. Researchers noticed that models fine-tuned with broader target sets - including the FFN projections - showed much better capability transfer for instruction following. This was documented publicly by Tim Dettmers in the QLoRA paper (June 2023), which recommended targeting all linear layers in the model.

The QLoRA recommendation ("target everything") became the new community default, but it was overcorrection. You pay for every additional target module in memory and compute. By 2024, the field had settled into a more nuanced understanding: target based on what your task requires, not a blanket rule.

Core Concepts - Understanding the Transformer Weight Landscape

The Four Attention Projections

Every transformer attention block contains four linear projections. Understanding what each one does tells you whether you need to adapt it.

Query projection ( $W_q$ ): Maps the input to query vectors. This controls what the model "looks for" when attending. Adapting $W_q$ changes which patterns in the sequence the model pays attention to.

Key projection ( $W_k$ ): Maps inputs to key vectors. This determines what gets "found" when queries look around. Together with $W_q$ , it controls the attention pattern.

Value projection ( $W_v$ ): Maps inputs to value vectors. This determines what information flows once attention is computed. Adapting $W_v$ changes what content is carried forward through the network.

Output projection ( $W_o$ ): Combines the attended values and projects back to the residual stream. Adapting this controls how attention outputs mix back into the main representation flow.

The LoRA paper found $W_q$ and $W_v$ to be the highest-impact pair. The intuition: queries and values together determine "what to attend to" and "what to extract." Keys and output projection are secondary. But this is a soft rule, not an absolute one.

The Three FFN Projections

Modern LLMs use a gated MLP design (SwiGLU or GeGLU) with three weight matrices per FFN block:

Gate projection ( $W_{gate}$ ): Controls which neurons are active via the gating mechanism. The gate decides how much signal passes through.

Up projection ( $W_{up}$ ): Projects the input to the intermediate (expanded) dimension. In a 7B model with hidden size 4096 and intermediate size 14336, this is a 4096 x 14336 matrix.

Down projection ( $W_{down}$ ): Projects back from the intermediate dimension to the hidden size. Adapting the down projection has the most direct effect on what the FFN "writes" to the residual stream.

FFN layers are where factual associations and domain knowledge live. If you are injecting new domain vocabulary or specialized knowledge, you need to touch these layers.

Embeddings and LM Head

Two additional targets matter for specific use cases:

embed_tokens: The input embedding table. Adapting this matters when you are adding new vocabulary (domain-specific tokens, a new language) or when your fine-tuning data uses tokens that appear rarely in pre-training.

lm_head: The output projection from hidden states to logits. This is functionally the transpose of the embedding table in most models. Adapting the lm_head matters most when you are changing the output distribution significantly - like adapting a base model to a specific output format.

For most instruction fine-tuning runs, skip both. They are high-risk (easy to damage the output distribution) and low-reward unless you have a specific reason.

The Rank Parameter - Intuition Before Math

What Rank Actually Controls

Recall from the LoRA mathematics lesson that we decompose the weight update as:

$\Delta W = BA$

where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$ , and $r \ll \min(d, k)$ .

The rank $r$ controls the dimensionality of the "change space." A rank-1 update can only express changes that lie in a single direction in weight space. A rank-8 update can express changes in up to 8 independent directions. A rank-64 update gives you 64 directions.

The intuition: every task you want to teach the model requires some "directions of change" in its representations. Simple tasks - following a fixed output format, responding in a specific tone - need few directions. Complex tasks - adapting to a new medical subdomain with specialized reasoning chains - need more.

If your rank is too low, the model cannot express the required changes. If your rank is too high relative to your dataset size, the model will overfit to training examples rather than learning generalizable patterns.

The Math of Trainable Parameters

For a single weight matrix of shape $d \times k$ , applying LoRA at rank $r$ adds:

$\text{params} = r \times d + r \times k = r(d + k)$

For the $q_{proj}$ layer in Llama 3 8B (hidden size 4096):

Shape: $4096 \times 4096$
Original params: $16{,}777{,}216$
LoRA at $r=8$ : $8 \times (4096 + 4096) = 65{,}536$ params
Reduction factor: $256\times$

Now imagine targeting all attention and FFN layers in a 7B model at $r=16$ :

Layer	Shape	LoRA params (r=16)
q_proj	4096 x 4096	131,072
k_proj	4096 x 1024	81,920
v_proj	4096 x 1024	81,920
o_proj	4096 x 4096	131,072
gate_proj	4096 x 14336	295,936
up_proj	4096 x 14336	295,936
down_proj	14336 x 4096	295,936

Per layer (32 layers in LLaMA 7B): $\approx 1.3M$ params Total: $\approx 42M$ trainable params out of 7B = 0.6%

At $r=64$ : approximately 166M trainable params = 2.4%

This is the tradeoff. Higher rank = more expressive = more memory = more risk of overfitting on small datasets.

Target Module Selection - The Decision Framework

Task Taxonomy

Think about fine-tuning tasks along two axes: behavioral change (how the model responds) vs knowledge injection (what the model knows).

                    High Knowledge Injection
                           |
     Medical QA            |    Legal Document Analysis
     Code from docs        |    Scientific paper writing
                           |
Low Behavioral  -----------+----------- High Behavioral
  Change                   |                 Change
                           |
     Formatting only       |    Instruction following
     Language style        |    Role-playing, tone
                           |
                    Low Knowledge Injection

Quadrant 1 - Low behavioral, high knowledge: New domain facts, terminology, specialized reasoning. Target FFN layers heavily. Add gate_proj, up_proj, down_proj alongside attention.

Quadrant 2 - High behavioral, high knowledge: Complex domain-specific tasks like medical diagnosis formatting, legal brief writing. Target everything: all attention projections plus all FFN projections.

Quadrant 3 - Low behavioral, low knowledge: Formatting changes, output structure, language style. Attention only (q_proj, v_proj) at low rank may be sufficient.

Quadrant 4 - High behavioral, low knowledge: Instruction following, chat format adaptation, tone changes. Attention layers (all four) at moderate rank.

Empirical Recommendations from the Literature

The QLoRA paper tested multiple target configurations on the MMLU benchmark with 4-bit Llama models. Key findings:

Attention-only (q, v): Establishes baseline capability, lowest memory
Attention (q, k, v, o): +0.5-1.5 MMLU points vs q, v only in most configs
Attention + FFN: +2-4 MMLU points vs attention-only for knowledge-intensive tasks
All linear layers: Marginal improvement over attention+FFN, higher cost

The community fine-tuning guide from axolotl and other projects settled on this practical default for LLaMA-family models:

# Most tasks - good starting point
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                  "gate_proj", "up_proj", "down_proj"]

# Light adaptation - formatting, tone, simple instruction following
target_modules = ["q_proj", "v_proj"]

# Maximum adaptation - domain knowledge injection
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                  "gate_proj", "up_proj", "down_proj",
                  "embed_tokens", "lm_head"]

Architecture Diagram - LoRA Target Positions in a Transformer Block

Purple nodes are the highest-impact attention targets. Teal nodes are FFN targets that carry domain knowledge. The residual connections (green) explain why deeper targeting (more modules) has diminishing returns: even untargeted layers are updated indirectly via gradients flowing through the residual stream.

Rank Selection - The Empirical Guide

The Rank-Task Complexity Map

Task Type	Recommended Rank	Why
Format only (JSON output, specific template)	r=4	Single behavioral direction needed
Simple instruction following	r=8	A few behavioral directions
General chat fine-tuning	r=8 to r=16	Moderate complexity
Domain adaptation (new terminology)	r=16 to r=32	New knowledge subspace
Code fine-tuning	r=16 to r=32	Reasoning + syntax patterns
Complex domain (medical, legal)	r=32 to r=64	Dense knowledge injection
Multi-task continual learning	r=64+	Multiple independent task directions

The guiding principle: rank should match the intrinsic dimensionality of the change you want. Most behavioral changes in transformer models are low-dimensional. Hu et al. (2021) showed that random projections of fine-tuning gradients at rank 4-8 capture the majority of variance for NLP tasks. Domain knowledge injection is higher-dimensional because it involves many independent factual associations.

Rank and Dataset Size Interaction

There is an important interaction between rank and dataset size that many practitioners miss:

$\text{effective rank} \approx \min\left(r, \frac{N}{k}\right)$

where $N$ is the number of training examples and $k$ is a constant (roughly 100-1000 depending on model size and task). In plain terms: if you only have 1,000 training examples, using $r=64$ does not give you 64 effective directions - the optimization will only find a few meaningful directions and fit noise in the rest.

Rule of thumb:

Under 1,000 examples: $r \leq 8$
1,000 - 10,000 examples: $r = 8$ to $r=16$
10,000 - 100,000 examples: $r = 16$ to $r=32$
Over 100,000 examples: $r = 32$ to $r=64$ may be warranted

The Alpha Scaling Parameter

The LoRA scaling factor $\alpha$ divides the output before adding to the pretrained weights:

$h = W_0 x + \frac{\alpha}{r} BA x$

The ratio $\alpha / r$ controls how much the LoRA update is scaled relative to the original weights. Common practice is to set $\alpha = 2r$ (doubling the scaling), but many practitioners set $\alpha = r$ (scaling factor of 1.0) or $\alpha = 16$ regardless of $r$ .

The key insight: what matters is the ratio $\alpha / r$ , not the absolute values. Setting $\alpha = 16, r = 8$ gives the same scaling as $\alpha = 32, r = 16$ . If you change $r$ , scale $\alpha$ proportionally to maintain the same effective learning rate.

RSLoRA - Fixing the Scaling Problem at High Ranks

The Problem with Standard LoRA Scaling

The original LoRA paper scaled the output by $\alpha / r$ . This was designed to keep the initialization noise small (matrix $A$ is initialized with random normal values, $B$ with zeros). But Kalajdzievski (2023) showed that this scaling causes a problem at higher ranks.

As rank increases, the gradient signal through the LoRA matrices changes in a way that makes higher-rank training less stable. Specifically, the stable learning rate decreases as $O(1/r)$ , meaning at $r=64$ you need to use $8\times$ lower learning rate than at $r=8$ to maintain stability. This largely cancels the benefit of higher rank.

The RSLoRA Fix

RSLoRA (Rank-Stabilized LoRA) changes the scaling factor from $\alpha / r$ to $\alpha / \sqrt{r}$ :

$h = W_0 x + \frac{\alpha}{\sqrt{r}} BA x$

This single change makes the optimal learning rate independent of rank, allowing you to train at $r=64$ with the same learning rate as $r=8$ . The result: higher ranks become actually useful rather than theoretically larger.

In PEFT, enable RSLoRA with one flag:

from peft import LoraConfig

config = LoraConfig(
    r=32,
    lora_alpha=32,
    use_rslora=True,  # enables alpha / sqrt(r) scaling
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

RSLoRA is almost always worth enabling when $r \geq 16$ . At $r=8$ the difference is negligible. At $r=32$ or higher, it can meaningfully improve convergence.

LoRA+ - Different Learning Rates for A and B

The Asymmetry Problem

The two LoRA matrices have structurally different roles. Matrix $A$ is initialized with random Gaussian values - it already has learned structure from the start. Matrix $B$ is initialized to zero - it starts completely flat and must learn its structure from scratch during training.

Using the same learning rate for both is suboptimal. Hayou et al. (2024) showed that matrix $B$ benefits from a higher learning rate than matrix $A$ . The optimal ratio is approximately $\eta_B / \eta_A = \lambda$ where $\lambda \in [4, 16]$ , with $\lambda = 16$ performing best in most experiments.

The intuition: matrix $B$ needs to "wake up" quickly from its zero initialization, while matrix $A$ already has a reasonable starting point and benefits from more conservative updates.

Implementing LoRA+

from peft import LoraConfig
from transformers import TrainingArguments

# Standard LoRA config - learning rates handled separately
lora_config = LoraConfig(
    r=16,
    lora_alpha=16,
    use_rslora=True,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

# For LoRA+: use the loraplus_lr_ratio in TRL's SFTTrainer
# or implement via custom optimizer with parameter groups

def get_loraplus_optimizer(model, lr_A=1e-4, lr_ratio=16.0):
    """Split LoRA parameters into A and B groups with different LRs."""
    lora_A_params = []
    lora_B_params = []
    other_params = []

    for name, param in model.named_parameters():
        if not param.requires_grad:
            continue
        if "lora_A" in name:
            lora_A_params.append(param)
        elif "lora_B" in name:
            lora_B_params.append(param)
        else:
            other_params.append(param)

    optimizer_groups = [
        {"params": lora_A_params, "lr": lr_A},
        {"params": lora_B_params, "lr": lr_A * lr_ratio},
        {"params": other_params, "lr": lr_A},
    ]

    return optimizer_groups

TRL's SFTTrainer supports LoRA+ natively via the loraplus_lr_ratio parameter - the simplest way to enable it.

DoRA - Decomposing Weights into Magnitude and Direction

The Core Idea

DoRA (Weight-Decomposed Low-Rank Adaptation, Liu et al., 2024) builds on an observation about how full fine-tuning changes weights: it tends to make small changes to the direction of weight vectors but larger changes to their magnitude.

Standard LoRA couples magnitude and direction changes together in the product $BA$ , making it hard to express pure directional changes efficiently. DoRA decomposes the pretrained weight $W_0$ as:

$W_0 = m \cdot \frac{V}{||V||_c}$

where $m$ is a learnable magnitude vector and $V / ||V||_c$ is the normalized column direction matrix. LoRA is then applied only to the directional component $V$ :

$W = (m + \Delta m) \cdot \frac{V + \Delta V}{||V + \Delta V||_c}$

where $\Delta V = BA$ is the standard LoRA update.

Why DoRA Helps

The benefit of DoRA shows up most clearly in tasks that require significant behavioral shift from the base model. Standard LoRA sometimes struggles because the coupled magnitude-direction update requires high rank to express what would be a simple directional change. DoRA decouples these, allowing directional adaptation to happen efficiently even at low rank.

In practice, DoRA at $r=8$ has been shown to match or exceed standard LoRA at $r=16$ on several benchmarks, effectively halving the parameter count for equivalent quality.

from peft import LoraConfig

config = LoraConfig(
    r=8,
    lora_alpha=16,
    use_dora=True,  # enables DoRA decomposition
    use_rslora=True,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

Note: DoRA adds a small amount of overhead during training (the normalization computation). For QLoRA setups, the overhead is typically under 5% compared to standard LoRA.

The Rank-Module Decision Flowchart

Code - Full Ablation Study Script

The only reliable way to verify your module and rank choices is an ablation study. This script runs a grid of configurations and reports results, so you can pick the best config before committing to a full training run.

"""
LoRA ablation study: systematically evaluate target modules and rank choices.
Run on a small subset of your data (500-1000 examples) to find the optimal config.
"""

import json
import time
from dataclasses import dataclass
from typing import List, Dict, Any

import torch
from datasets import Dataset
from peft import LoraConfig, get_peft_model, TaskType
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForSeq2Seq,
)


@dataclass
class AblationConfig:
    name: str
    target_modules: List[str]
    r: int
    lora_alpha: int
    use_rslora: bool = False
    use_dora: bool = False


# Define ablation grid
ABLATION_CONFIGS = [
    AblationConfig(
        name="baseline_qv_r8",
        target_modules=["q_proj", "v_proj"],
        r=8,
        lora_alpha=16,
    ),
    AblationConfig(
        name="attention_all_r8",
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
        r=8,
        lora_alpha=16,
    ),
    AblationConfig(
        name="attention_all_r16",
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
        r=16,
        lora_alpha=16,
        use_rslora=True,
    ),
    AblationConfig(
        name="full_r16",
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                        "gate_proj", "up_proj", "down_proj"],
        r=16,
        lora_alpha=16,
        use_rslora=True,
    ),
    AblationConfig(
        name="full_r32",
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                        "gate_proj", "up_proj", "down_proj"],
        r=32,
        lora_alpha=32,
        use_rslora=True,
    ),
    AblationConfig(
        name="full_r16_dora",
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                        "gate_proj", "up_proj", "down_proj"],
        r=16,
        lora_alpha=16,
        use_rslora=True,
        use_dora=True,
    ),
]


def count_trainable_params(model) -> Dict[str, int]:
    """Count total and trainable parameters."""
    total = sum(p.numel() for p in model.parameters())
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    return {"total": total, "trainable": trainable, "pct": 100 * trainable / total}


def run_ablation(
    model_id: str,
    train_dataset: Dataset,
    eval_dataset: Dataset,
    output_dir: str = "./ablation_results",
    max_steps: int = 100,  # short runs for ablation
) -> List[Dict[str, Any]]:
    """Run ablation study across all configs."""

    tokenizer = AutoTokenizer.from_pretrained(model_id)
    tokenizer.pad_token = tokenizer.eos_token

    results = []

    for cfg in ABLATION_CONFIGS:
        print(f"\n{'='*60}")
        print(f"Running: {cfg.name}")
        print(f"Modules: {cfg.target_modules}")
        print(f"Rank: {cfg.r}, Alpha: {cfg.lora_alpha}")
        print(f"RSLoRA: {cfg.use_rslora}, DoRA: {cfg.use_dora}")

        # Load base model fresh for each run
        model = AutoModelForCausalLM.from_pretrained(
            model_id,
            torch_dtype=torch.bfloat16,
            device_map="auto",
        )

        # Apply LoRA config
        lora_config = LoraConfig(
            r=cfg.r,
            lora_alpha=cfg.lora_alpha,
            target_modules=cfg.target_modules,
            lora_dropout=0.05,
            bias="none",
            task_type=TaskType.CAUSAL_LM,
            use_rslora=cfg.use_rslora,
            use_dora=cfg.use_dora,
        )

        model = get_peft_model(model, lora_config)
        param_info = count_trainable_params(model)

        print(f"Trainable params: {param_info['trainable']:,} ({param_info['pct']:.2f}%)")

        # Training args - short run for ablation
        training_args = TrainingArguments(
            output_dir=f"{output_dir}/{cfg.name}",
            max_steps=max_steps,
            per_device_train_batch_size=2,
            gradient_accumulation_steps=4,
            learning_rate=2e-4,
            lr_scheduler_type="cosine",
            warmup_steps=10,
            logging_steps=10,
            eval_strategy="steps",
            eval_steps=50,
            save_strategy="no",
            bf16=True,
            dataloader_num_workers=0,
            report_to="none",
        )

        data_collator = DataCollatorForSeq2Seq(
            tokenizer=tokenizer,
            model=model,
            padding=True,
        )

        trainer = Trainer(
            model=model,
            args=training_args,
            train_dataset=train_dataset,
            eval_dataset=eval_dataset,
            data_collator=data_collator,
        )

        start_time = time.time()
        train_result = trainer.train()
        elapsed = time.time() - start_time

        eval_result = trainer.evaluate()

        result = {
            "config": cfg.name,
            "target_modules": cfg.target_modules,
            "rank": cfg.r,
            "lora_alpha": cfg.lora_alpha,
            "use_rslora": cfg.use_rslora,
            "use_dora": cfg.use_dora,
            "trainable_params": param_info["trainable"],
            "trainable_pct": param_info["pct"],
            "train_loss": train_result.training_loss,
            "eval_loss": eval_result.get("eval_loss"),
            "train_time_seconds": elapsed,
        }

        results.append(result)

        # Clean up memory
        del model
        torch.cuda.empty_cache()

        print(f"Train loss: {result['train_loss']:.4f}")
        print(f"Eval loss: {result['eval_loss']:.4f}")

    # Sort by eval loss and print summary
    results.sort(key=lambda x: x["eval_loss"] or float("inf"))

    print(f"\n{'='*60}")
    print("ABLATION SUMMARY (sorted by eval loss)")
    print(f"{'='*60}")
    print(f"{'Config':<25} {'Rank':<6} {'Params':<12} {'Train Loss':<12} {'Eval Loss':<12}")
    print("-" * 70)
    for r in results:
        print(f"{r['config']:<25} {r['rank']:<6} "
              f"{r['trainable_params']:>10,}  "
              f"{r['train_loss']:<12.4f} "
              f"{r['eval_loss']:<12.4f}")

    # Save results
    with open(f"{output_dir}/ablation_results.json", "w") as f:
        json.dump(results, f, indent=2)

    print(f"\nResults saved to {output_dir}/ablation_results.json")
    print(f"Recommended config: {results[0]['config']}")

    return results


if __name__ == "__main__":
    # Example usage with a toy dataset
    # Replace with your actual tokenized dataset
    print("Ablation study framework loaded.")
    print("Call run_ablation(model_id, train_dataset, eval_dataset) to start.")

Quick-Reference Config Builder

def build_lora_config(
    task_type: str = "instruction_following",
    dataset_size: int = 5000,
    domain_complexity: str = "medium",
) -> LoraConfig:
    """
    Build a LoRA config based on task type and dataset size.

    task_type: "format_only" | "instruction_following" | "domain_adaptation" | "code"
    dataset_size: number of training examples
    domain_complexity: "low" | "medium" | "high"
    """
    from peft import LoraConfig, TaskType

    # Base module sets
    ATTENTION_MINIMAL = ["q_proj", "v_proj"]
    ATTENTION_ALL = ["q_proj", "k_proj", "v_proj", "o_proj"]
    ATTENTION_FFN = ["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"]

    # Select modules based on task
    if task_type == "format_only":
        modules = ATTENTION_MINIMAL
        base_rank = 4
    elif task_type == "instruction_following":
        modules = ATTENTION_ALL
        base_rank = 8
    elif task_type in ("domain_adaptation", "code"):
        modules = ATTENTION_FFN
        base_rank = 16 if domain_complexity == "low" else 32
    else:
        modules = ATTENTION_FFN
        base_rank = 16

    # Scale rank with dataset size
    if dataset_size < 1000:
        rank = min(base_rank, 8)
    elif dataset_size < 10000:
        rank = base_rank
    else:
        rank = min(base_rank * 2, 64)

    # Enable RSLoRA for high ranks
    use_rslora = rank >= 16

    print(f"Recommended config:")
    print(f"  target_modules = {modules}")
    print(f"  r = {rank}")
    print(f"  lora_alpha = {rank}")
    print(f"  use_rslora = {use_rslora}")

    return LoraConfig(
        r=rank,
        lora_alpha=rank,
        target_modules=modules,
        lora_dropout=0.05,
        bias="none",
        task_type=TaskType.CAUSAL_LM,
        use_rslora=use_rslora,
    )


# Usage examples
config_simple = build_lora_config("format_only", dataset_size=500)
config_standard = build_lora_config("instruction_following", dataset_size=5000)
config_domain = build_lora_config("domain_adaptation", dataset_size=50000, domain_complexity="high")

The Rank Selection Visualization

Production Engineering Notes

Memory Impact of Module Selection

Adding FFN modules to your target set has an outsized memory impact compared to attention modules. In a 7B model:

Attention layers (q, k, v, o): ~4 x 16MB per layer at r=16
FFN layers (gate, up, down): ~3 x 32-80MB per layer at r=16

The FFN up/down projections are large because the intermediate size (14336 in Llama-7B) is 3.5x the hidden size. This means targeting FFN layers can increase LoRA adapter memory by 3-4x compared to attention-only.

For QLoRA runs on 24GB GPUs, targeting all 7 modules at r=32 can push you to the memory limit. Monitor with nvidia-smi or torch.cuda.memory_summary() before committing to a full training run.

Gradient Checkpointing Interaction

When you enable gradient checkpointing (which you should for large models), it recomputes activations during the backward pass rather than storing them. This saves memory at the cost of extra compute. The interaction with LoRA targets matters: layers that are NOT targeted still participate in the backward pass for gradient checkpointing. Adding more target modules increases the gradient checkpointing compute slightly, but the effect is small compared to the memory saving.

Saving and Loading LoRA Adapters

PEFT saves only the LoRA adapter weights, not the full model. The saved files include:

adapter_config.json - stores target_modules, r, lora_alpha, and all other hyperparameters
adapter_model.safetensors - the actual LoRA weight tensors

When loading, the base model must have identical architecture to what was used during training. If you change target_modules between runs, the adapter is not compatible with the base model. Always keep your config in version control alongside your adapter checkpoints.

Merging Adapters

After training, you can merge the LoRA adapter into the base model weights for zero-overhead inference:

from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    torch_dtype=torch.bfloat16,
    device_map="cpu",  # merge on CPU to avoid GPU memory pressure
)

# Load adapter
model = PeftModel.from_pretrained(base_model, "./lora_adapter")

# Merge and unload
merged_model = model.merge_and_unload()

# Save merged model
merged_model.save_pretrained("./merged_model")

After merging, there is no LoRA overhead at inference time. The tradeoff: you can no longer swap adapters or adjust the LoRA weights.

Common Mistakes

:::danger Target Module Mismatch

The most common error: using target module names from one model family on a different family.

# Llama 3 uses these names
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"]

# Mistral uses the same names - fine
# Phi-3 uses: "qkv_proj" (combined QKV matrix!) - will fail silently or error

# GPT-NeoX / Pythia uses:
# "query_key_value" (fused), "dense", "dense_h_to_4h", "dense_4h_to_h"

# Falcon uses:
# "query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"

Always inspect your model's named modules before setting target_modules:

model = AutoModelForCausalLM.from_pretrained(model_id)
for name, module in model.named_modules():
    if hasattr(module, 'weight'):
        print(name, type(module).__name__, module.weight.shape)

Look for Linear layers in the output. Those are your candidates.

:::

:::danger Rank Too High for Dataset Size

Training at r=64 on 500 examples will overfit severely. The LoRA matrices have enough capacity to memorize the training set completely, and you will see near-zero train loss with terrible generalization.

Signs of overfitting:

Train loss converges to near 0 while eval loss increases
The model generates near-verbatim copies of training examples
Performance on held-out prompts is worse than the base model

Fix: reduce rank or increase your dataset. For 500 examples, r=4 or r=8 is appropriate.

:::

:::warning Alpha/Rank Ratio Confusion

Setting lora_alpha=16 and then changing r without adjusting alpha changes your effective learning rate. Many practitioners fix lora_alpha=16 as a "safe default" and then sweep ranks, unknowingly changing the effective learning rate for every rank value.

Either:

Always set lora_alpha = r (scaling factor of 1.0)
Always set lora_alpha = 2*r (scaling factor of 2.0, original paper default)
Enable RSLoRA, which stabilizes scaling across ranks

Never leave lora_alpha as a fixed constant while sweeping r.

:::

:::warning Forgetting to Set requires_grad for Non-LoRA Parameters

When using PEFT with get_peft_model(), all base model parameters are frozen automatically. But if you manually apply LoRA or use custom model modifications, verify that only LoRA parameters have requires_grad=True:

model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()
# Should show: trainable params: X || all params: Y || trainable%: Z

# Verify manually
trainable = [n for n, p in model.named_parameters() if p.requires_grad]
assert all("lora_" in name for name in trainable), "Non-LoRA params are trainable!"

If non-LoRA parameters are trainable, you are not doing PEFT - you are doing full fine-tuning with a huge memory footprint and no adapter portability.

:::

Interview Q&A

A: The original LoRA paper (Hu et al., 2021) was evaluated on GPT-3 for tasks like NLG and translation. In that narrow context, the query and value projections captured the most relevant adaptation directions. The paper also showed diminishing returns from adding k_proj and o_proj.

You should deviate from this default when your task requires knowledge injection rather than just behavioral adaptation. FFN layers (gate, up, down projections) store factual associations and domain vocabulary. If you are fine-tuning for medical QA, legal document analysis, or a specialized technical domain, the model needs to update its factual knowledge store, not just its attention patterns. In practice: always add o_proj to the attention set (it carries the combined attention output back to the residual stream), and add FFN layers for any knowledge-intensive task. The cost in memory is real but usually worth it.

Q2: What is RSLoRA and why does it matter for high-rank training?

A: RSLoRA (Rank-Stabilized LoRA, Kalajdzievski 2023) changes the LoRA output scaling from $\alpha / r$ to $\alpha / \sqrt{r}$ . The original scaling causes the optimal learning rate to decrease as $O(1/r)$ , meaning training at r=64 requires an 8x lower learning rate than training at r=8 to stay stable. This largely negates the benefit of higher rank. RSLoRA's $1/\sqrt{r}$ scaling makes the optimal learning rate approximately rank-independent, so you can train at r=32 or r=64 with the same learning rate as r=8 while actually benefiting from the higher expressivity. Always enable RSLoRA when using r=16 or higher. In PEFT: use_rslora=True.

Q3: What is DoRA and how does it differ from standard LoRA?

A: DoRA (Weight-Decomposed Low-Rank Adaptation, Liu et al. 2024) decomposes each pretrained weight matrix into a magnitude component and a directional component, then applies standard LoRA only to the directional component. The motivation: full fine-tuning tends to make large magnitude changes and small directional changes. Standard LoRA couples both in the product $BA$ , which is inefficient for tasks requiring mostly directional adaptation. DoRA allows the magnitude to be learned directly (cheap - one scalar per column) while applying LoRA to the direction efficiently. Empirically, DoRA at r=8 often matches standard LoRA at r=16, halving the parameter count. Enable with use_dora=True in PEFT's LoraConfig.

Q4: How does rank interact with dataset size, and what are the practical implications?

A: Rank controls the capacity of the LoRA adapter - how many independent directions of change it can express. But effective capacity is bounded by the amount of training signal available. With N training examples, the optimizer can reliably identify at most approximately $N / 100$ independent directions (a rough empirical rule). Using rank higher than this effective ceiling does not provide more capacity - the extra dimensions fit noise from the training set.

Practical implications: at 1,000 examples, r=8 to r=16 is the safe zone. Going to r=64 on 1,000 examples will overfit. At 100,000 examples, r=32 to r=64 becomes warranted for complex tasks. The symptom of rank-overfitting is training loss near zero with degraded generalization on held-out prompts.

Q5: How do you decide between the "target all linear layers" approach vs targeted module selection?

A: "Target all linear layers" is the safe default when you do not know your task well. It avoids the risk of under-targeting (missing a layer that mattered) at the cost of more parameters and memory. The selective approach is better when you have constraints (limited GPU memory, need minimal adapter size for serving) or when you have run ablations to confirm which modules matter.

The principled approach: run a short ablation (100-200 training steps on a subset) with three configurations - attention-only, attention+FFN, and all linear. If attention+FFN matches all-linear on eval loss, drop the extra modules. If attention-only is close to attention+FFN, you may not need FFN. The 1-2 hours spent on this ablation pays off in the full training run.

Q6: What is LoRA+ and when should you use it?

A: LoRA+ (Hayou et al., 2024) sets different learning rates for the A and B matrices in each LoRA layer. Matrix B is initialized to zero and benefits from a higher learning rate (the recommended ratio is $\eta_B = 16 \times \eta_A$ ). Matrix A has random initialization and benefits from a more conservative learning rate. Standard LoRA uses the same learning rate for both, which is suboptimal for B's "cold start" from zero. LoRA+ consistently improves convergence speed and final quality, especially on tasks that require large behavioral changes from the base model. In TRL's SFTTrainer, enable it with loraplus_lr_ratio=16. The cost is essentially zero - just a different parameter group in the optimizer.

Summary

Target module and rank selection are the most impactful decisions in any LoRA fine-tuning run - more impactful than learning rate or batch size in most cases.

The core decision tree:

Start with the task quadrant: behavioral change vs knowledge injection
For knowledge injection, add FFN modules; for behavioral change, attention is sufficient
Set rank based on dataset size: small data = low rank, large data = higher rank
Enable RSLoRA when r=16 or higher for stable training
Consider DoRA for parameter efficiency at equivalent quality
Run a short ablation to validate before committing to full training

The cost of getting this wrong is significant: wasted GPU hours, underfitting that no hyperparameter tuning can fix, or overfitting that produces a model worse than the base. The cost of getting it right is a model that generalizes well from minimal data.

The Production Decision That Determines Everything​

Why This Exists - The Problem Before LoRA Targeting​

Historical Context - From GPT-3 Experiments to Llama 3​

Core Concepts - Understanding the Transformer Weight Landscape​

The Four Attention Projections​

The Three FFN Projections​

Embeddings and LM Head​

The Rank Parameter - Intuition Before Math​

What Rank Actually Controls​

The Math of Trainable Parameters​

Target Module Selection - The Decision Framework​

Task Taxonomy​

Empirical Recommendations from the Literature​

Architecture Diagram - LoRA Target Positions in a Transformer Block​

Rank Selection - The Empirical Guide​

The Rank-Task Complexity Map​

Rank and Dataset Size Interaction​

The Alpha Scaling Parameter​

RSLoRA - Fixing the Scaling Problem at High Ranks​

The Problem with Standard LoRA Scaling​

The RSLoRA Fix​

LoRA+ - Different Learning Rates for A and B​

The Asymmetry Problem​

Implementing LoRA+​

DoRA - Decomposing Weights into Magnitude and Direction​

The Core Idea​

Why DoRA Helps​

The Rank-Module Decision Flowchart​

Code - Full Ablation Study Script​

Quick-Reference Config Builder​

The Rank Selection Visualization​

Production Engineering Notes​

Memory Impact of Module Selection​

Gradient Checkpointing Interaction​

Saving and Loading LoRA Adapters​

Merging Adapters​

Common Mistakes​

Interview Q&A​

Q1: Why does the original LoRA paper recommend targeting only q_proj and v_proj, and when should you deviate from this?​

Q2: What is RSLoRA and why does it matter for high-rank training?​

Q3: What is DoRA and how does it differ from standard LoRA?​

Q4: How does rank interact with dataset size, and what are the practical implications?​

Q5: How do you decide between the "target all linear layers" approach vs targeted module selection?​

Q6: What is LoRA+ and when should you use it?​

Summary​