What is full fine-tuning vs LoRA?

A practical decision framework for choosing between full fine-tuning, LoRA, QLoRA, prompt tuning, and other PEFT methods based on your model size, data, and quality requirements.

How does PEFT comparison work in practice?

Full Fine-Tuning vs PEFT: Decision Framework covers full fine-tuning vs LoRA, PEFT comparison, parameter efficient fine-tuning from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/pretraining-and-finetuning/full-fine-tuning-vs-peft

What is the difference between full fine-tuning vs LoRA and parameter efficient fine-tuning?

See the full breakdown at https://engineersofai.com/docs/llms/pretraining-and-finetuning/full-fine-tuning-vs-peft

Full Fine-Tuning vs PEFT: Decision Framework

The Decision Nobody Talks About Explicitly

A production ML team gets a new requirement: adapt a 13B model to the company's internal document style. The team lead opens up a planning doc and immediately faces a decision that will determine the next two weeks of engineering work, the GPU budget, and the final model quality: do we full fine-tune, or do we use LoRA?

Everyone on the team has an opinion. The researcher says full fine-tune - always better quality. The infrastructure engineer says LoRA - GPU costs are out of control. The product manager says "can we just use prompt engineering?" The team lead goes in circles.

This lesson is the framework that makes this decision principled instead of emotional.

The right answer depends on four things: (1) how much GPU memory do you have, (2) how many parameters does your model have, (3) how much distribution shift does your task require, and (4) how important is inference efficiency. Get these four numbers, apply the framework, and the decision makes itself.

The Memory Wall: A Concrete Comparison

Before any quality discussion, you need to know what is even possible given your hardware.

Memory Requirements by Method (approximate, FP16/BF16 base, Adam optimizer)

Model	Full FT (BF16 + Adam)	LoRA r=16	QLoRA (NF4 + LoRA)
7B	~60GB	~18GB	~12GB
13B	~104GB	~30GB	~18GB
33B	~280GB	~70GB	~24GB
65B	~520GB	~140GB	~40GB
70B	~560GB	~150GB	~42GB

Notes: Full FT requires 4 bytes/param (model) + 4 bytes/param (gradient) + 8 bytes/param (Adam states) = 16 bytes/param. BF16 reduces this to ~10 bytes/param. LoRA keeps base model in BF16 (2 bytes/param) but only trains LoRA parameters (~0.5% of params for r=16). QLoRA uses NF4 base (0.5 bytes/param) plus BF16 LoRA.

Quality Comparison: Does LoRA Actually Match Full Fine-Tuning?

The core question: does the lower-rank constraint of LoRA hurt quality?

The empirical answer: for most tasks, LoRA at r=16 or higher is within 1-2% of full fine-tuning quality. For some tasks requiring large distribution shift, full fine-tuning is meaningfully better.

Evidence:

Hu et al. (2021): LoRA with r=4 on GPT-3 matched full fine-tuning on WikiSQL (+0.4% LoRA), MultiNLI (-0.4% LoRA), SAMSum (+0.2% LoRA)
Dettmers et al. (2023): QLoRA at r=64 on MMLU scored within 1 point of full fine-tuning baselines
Chen et al. (2022): LoRA vs full fine-tuning on code generation - full fine-tuning wins by ~3% on complex coding benchmarks
Liu et al. (2022): For domain pretraining (medical, legal) - full fine-tuning produces significantly better representations than LoRA

Rule of thumb:

Style/format/instruction following: LoRA ≈ Full FT
Domain adaptation (same task, different domain): LoRA slightly worse but acceptable
New task types with significant knowledge requirement: Full FT wins
Continual pretraining (learning new facts): Full FT significantly better

The Full Spectrum of PEFT Methods

Beyond LoRA, there is a rich ecosystem of parameter-efficient methods:

Prefix Tuning (Li and Liang, 2021)

Instead of modifying the model weights, prefix tuning prepends trainable "virtual tokens" to the input at each transformer layer. These prefix vectors are learned during fine-tuning; the model weights are frozen.

Trainable parameters: prefix_length * num_layers * hidden_dim * 2 (for K and V)
Typical prefix length: 10-50 tokens
Very parameter-efficient but harder to train (requires larger learning rates)
Best for seq2seq tasks (T5-style) - less effective for causal LMs

Prompt Tuning (Lester et al., 2021)

A simpler variant: prepend trainable soft tokens only at the input embedding layer (not at every layer). Extremely few parameters: prefix_length * embedding_dim. Works surprisingly well at large model scales (11B+) but poorly at smaller scales (less than 1B).

IA3 (Liu et al., 2022)

Instead of adding matrices like LoRA, IA3 rescales existing attention and MLP outputs by learning small vectors. Even fewer parameters than LoRA (typically 0.01% of total parameters). Works well for few-shot scenarios with very limited data.

DoRA (Liu et al., 2024)

Decomposed LoRA: separates the weight update into a magnitude component and a direction component, updating each separately. Often matches or exceeds LoRA quality with the same number of parameters, especially for tasks requiring large updates.

Adapter Layers (Houlsby et al., 2019)

Insert small bottleneck layers between transformer layers. Parameters: 2 * adapter_size * hidden_dim * num_layers. Works well but adds inference latency (unlike LoRA, which can be merged). Largely superseded by LoRA in practice.

When Full Fine-Tuning is Worth the Cost

Full fine-tuning should be your choice when:

1. Continual pretraining / domain pretraining

You have a large corpus of domain-specific text (medical journals, legal documents, code repositories) and you want the model to deeply internalize domain knowledge - not just learn a new response style. Full fine-tuning allows all weights to shift, letting domain-specific patterns propagate throughout the model. LoRA's frozen base weights preserve the original distribution, limiting how much the model can adapt to a very different domain.

Example: Bloomberg trained BloombergGPT (Shah et al., 2023) from scratch on financial data. A full fine-tuned model on financial text would be second best; a LoRA fine-tuned model third.

2. Very large, high-quality datasets (1M+ examples)

With millions of high-quality examples, full fine-tuning can leverage the entire model capacity to fit the distribution. At this data scale, the risk of catastrophic forgetting is lower (the data is large enough to reinforce general capabilities alongside task-specific ones). LoRA's parameter budget (0.1-0.5% of total) may become the bottleneck.

3. Dramatic distribution shift from base model

If your target domain uses very different vocabulary, syntax, or reasoning patterns than the base model's training data, full fine-tuning allows deeper adaptation. Medical/legal/scientific language often falls in this category.

4. When inference efficiency is the priority

A full fine-tuned model has no inference overhead. LoRA without merging adds ~5% latency. While this is small, at very high request volumes (millions of requests/day), 5% latency adds up. And merging requires an extra step. Full fine-tuning produces a clean, standalone model.

When LoRA is the Right Choice

The 80% case: LoRA is the right choice in most practical fine-tuning scenarios.

Limited GPU budget: LoRA makes 7B-70B fine-tuning accessible on single GPUs
Frequent task switching: maintain one base model with many adapters
Small-to-medium datasets (1K-100K examples): LoRA's regularization helps avoid overfitting; full fine-tuning on small data can overfit
Style and format adaptation: teaching the model a specific response style - full fine-tuning provides no benefit
Multi-tenant serving: one base model, many customer-specific adapters
Rapid iteration: save and load adapters quickly, experiment with different rank values

The Decision Flowchart

Multi-Task Fine-Tuning

Multi-task fine-tuning (MTF) trains a single model on multiple tasks simultaneously. This approach has several advantages over separate single-task models:

Shared representations: tasks that share useful features (e.g., sentiment analysis and tone classification) benefit from shared learning
Regularization: training on multiple tasks prevents overfitting to any single task
Inference efficiency: one model for many tasks instead of many models

Challenge: task balancing. If task A has 1M examples and task B has 1K examples, the model will optimize primarily for task A. Solutions:

Temperature sampling: oversample smaller tasks by drawing from each task proportionally to $N_i^{0.7}$ (temperature scaling) rather than $N_i$
Caps: cap the number of examples from any single task at a maximum value
Upsampling: duplicate small task examples

When MTF beats single-task fine-tuning: when tasks are related and you can benefit from knowledge sharing. Fine-tuning on sentiment, emotion classification, and toxicity detection together typically produces better representations for all three tasks than fine-tuning each separately.

Practical Recommendations by Use Case

Use Case	Recommended Method	Rank / Notes
Chat/instruction following (7B)	LoRA	r=16, all linear layers
Domain Q&A, small dataset	LoRA	r=16, watch for overfitting
Domain Q&A, large dataset	Full FT	If GPU budget allows
Code generation	LoRA r=32+	Code requires more capacity
Math/reasoning	LoRA r=64 or Full FT	Reasoning benefits from full capacity
Medical domain pretraining	Full FT	Significant distribution shift
Per-user personalization	LoRA r=4	Small, fast adapter per user
Multi-tenant serving	LoRA r=8-16	One base, many adapters
Very limited budget (laptop)	QLoRA	4-bit + LoRA
Style/tone adaptation	Prompt tuning or LoRA r=4	Minimal changes needed

Code: Comparing Methods

"""
Side-by-side comparison of full fine-tuning vs LoRA memory usage.
Shows how to set up each and compare memory footprints.
"""

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType


def get_model_memory_gb(model) -> float:
    """Estimate model memory in GB."""
    total_bytes = sum(p.numel() * p.element_size() for p in model.parameters())
    return total_bytes / 1e9


def setup_full_finetuning(model_name: str):
    """Setup for full fine-tuning - all parameters trainable."""
    model = AutoModelForCausalLM.from_pretrained(
        model_name, torch_dtype=torch.bfloat16
    )
    # All parameters are trainable by default
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total = sum(p.numel() for p in model.parameters())
    print(f"Full FT: {trainable:,} trainable / {total:,} total ({100.0:.1f}%)")
    return model


def setup_lora(model_name: str, r: int = 16, alpha: int = 32):
    """Setup LoRA - only low-rank adapter parameters trainable."""
    model = AutoModelForCausalLM.from_pretrained(
        model_name, torch_dtype=torch.bfloat16, use_cache=False
    )
    lora_config = LoraConfig(
        r=r,
        lora_alpha=alpha,
        target_modules=["q_proj", "v_proj", "k_proj", "o_proj",
                         "gate_proj", "up_proj", "down_proj"],
        task_type=TaskType.CAUSAL_LM,
    )
    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()
    return model


def compare_training_memory(model_name: str = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"):
    """Compare approximate memory requirements for different methods."""
    print(f"\n{'='*60}")
    print(f"Memory comparison for {model_name}")
    print(f"{'='*60}")

    # Full FT
    model_full = setup_full_finetuning(model_name)
    model_params_gb = get_model_memory_gb(model_full)
    print(f"\nFull fine-tuning:")
    print(f"  Model weights (BF16):      {model_params_gb:.2f} GB")
    print(f"  Gradients:                 {model_params_gb:.2f} GB")
    print(f"  Adam states (2x):          {model_params_gb * 2:.2f} GB")
    print(f"  Total estimate:            {model_params_gb * 4:.2f} GB")

    del model_full
    torch.cuda.empty_cache() if torch.cuda.is_available() else None

    # LoRA r=16
    model_lora = setup_lora(model_name, r=16)
    lora_trainable_gb = sum(
        p.numel() * p.element_size()
        for p in model_lora.parameters()
        if p.requires_grad
    ) / 1e9
    print(f"\nLoRA r=16:")
    print(f"  Model weights (BF16):      {model_params_gb:.2f} GB")
    print(f"  LoRA gradients:            {lora_trainable_gb:.4f} GB")
    print(f"  LoRA Adam states (2x):     {lora_trainable_gb * 2:.4f} GB")
    print(f"  Total estimate:            {model_params_gb + lora_trainable_gb * 3:.2f} GB")

    del model_lora
    print(f"\nLoRA r=16 memory savings vs full FT: "
          f"{model_params_gb * 4 / (model_params_gb + lora_trainable_gb * 3):.1f}x")


# Adapter-specific comparison
def lora_rank_comparison(model_name: str):
    """Show quality vs parameter count trade-off for different ranks."""
    print("\nLoRA Rank Comparison:")
    print(f"{'Rank':>6} | {'Trainable Params':>18} | {'% of Total':>12}")
    print("-" * 45)

    for r in [4, 8, 16, 32, 64, 128]:
        model = AutoModelForCausalLM.from_pretrained(
            model_name, torch_dtype=torch.bfloat16
        )
        lora_config = LoraConfig(
            r=r, lora_alpha=r,
            target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
            task_type=TaskType.CAUSAL_LM,
        )
        model = get_peft_model(model, lora_config)
        trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
        total = sum(p.numel() for p in model.parameters())
        print(f"{r:>6} | {trainable:>18,} | {100*trainable/total:>11.3f}%")
        del model

Production Engineering Notes

Choosing Between LoRA and Full FT for a Real Project

In practice, the decision is usually driven by constraints rather than pure quality optimization:

Start with LoRA - it is 10-20x faster to iterate on (smaller model footprint, faster experiment cycles)
Establish a quality baseline with LoRA r=16 or r=32
Only move to full fine-tuning if LoRA's quality is provably insufficient for your requirements, AND you have the budget

Many teams discover that LoRA r=32 is "good enough" and never need full fine-tuning. The teams that need full fine-tuning usually have one of: domain pretraining need, 1M+ examples, or extreme quality requirements.

The Cost of Being Wrong

If you choose full fine-tuning when LoRA would suffice:

Wasted GPU budget
Longer iteration cycles
Cannot easily maintain multiple adapters

If you choose LoRA when full fine-tuning is needed:

Slightly lower quality (typically 1-5% on benchmarks)
May need to retrain if quality bar is not met

The cost of under-investing in method choice (LoRA → Full FT) is usually smaller than the cost of over-investing (Full FT when LoRA suffices). Default to LoRA, escalate if needed.

tip

The "LoRA r=64 first" heuristic When in doubt, start with LoRA r=64 applied to all linear layers. This gives near-full-fine-tuning quality for most tasks while remaining far more memory-efficient. If r=64 quality is insufficient, that is strong evidence that you need full fine-tuning - and you have data to support the decision.

Common Mistakes

danger

Confusing parameter count with quality More parameters does not always mean better results. LoRA at r=16 often outperforms full fine-tuning on small datasets because the low-rank constraint acts as a regularizer - it prevents overfitting. On a 500-example dataset, full fine-tuning can memorize the training set; LoRA is forced to find a more generalizable solution. Always evaluate both methods on a held-out test set before concluding which is better.

danger

Using prompt tuning for tasks that require significant adaptation Prompt tuning prepends a few soft tokens to the input - the model weights are completely frozen. This means the model can only adapt by "priming" its context, not by actually changing its internal representations. Prompt tuning works only for large models (11B+) and simple format changes. For anything requiring meaningful domain adaptation or complex task learning, use LoRA minimum.

warning

Not merging LoRA adapters before production deployment Running a model with an unmerged LoRA adapter adds ~5% inference latency from the extra matrix multiplications. At 1 million requests per day with 200ms average latency, this is 10ms per request = 2.78 GPU-hours wasted per day. For high-throughput production systems, always merge the LoRA adapter before deployment. The merged model is identical to the base model architecture - no inference overhead.

warning

Using a single learning rate for both base model and LoRA When doing full fine-tuning or LoRA with unfrozen embedding layers, different parts of the model benefit from different learning rates. Use layer-wise learning rate decay: later layers (closer to the output) get the full learning rate, earlier layers get a lower rate (multiplied by a decay factor, typically 0.9 per layer from the output). This preserves general representations in early layers while allowing task-specific adaptation in later layers.

Interview Q&A

Q1: A company wants to fine-tune LLaMA-2-70B for their customer service chatbot. They have a 24GB GPU and 50,000 training examples. What would you recommend?

With a 24GB GPU and a 70B model, full fine-tuning is not possible (would require ~700GB). Regular LoRA in BF16 requires 150GB for the base model alone - also not possible. QLoRA (NF4 base + BF16 LoRA) requires approximately 42GB - borderline for 24GB, but possible if you use small batch sizes, gradient checkpointing, and sequence length around 1024. In practice, I would consider: (1) using a smaller base model (LLaMA-2-13B works well for customer service and fits QLoRA in 18GB); (2) if 70B is required, rent an A100 80GB for the training run ($3-8 per hour) and deploy the fine-tuned 70B or merge + re-quantize to a smaller format for serving. 50,000 examples is a solid dataset for LoRA r=16 to r=32.

Q2: What is the quality trade-off between LoRA and full fine-tuning on a reasoning task?

Reasoning tasks (math, logic, multi-step QA) generally show a larger quality gap between LoRA and full fine-tuning than simpler tasks. Studies on math benchmarks show full fine-tuning outperforming LoRA r=16 by 3-7%. The reason: reasoning requires the model to update how it "chains" representations across layers, not just change the final layer's output format. LoRA's low-rank constraint limits how much the model's internal computation paths can change. Mitigation: use higher rank (LoRA r=64 or r=128) for reasoning tasks, apply LoRA to all layers including MLP, and include chain-of-thought examples in the training data.

Q3: What is multi-task fine-tuning and when does it help?

Multi-task fine-tuning trains a model simultaneously on multiple tasks, mixing examples from all tasks in each training batch. It helps when: (1) tasks share useful representations (NER and relation extraction both benefit from entity-aware representations); (2) any individual task has too few examples to train a good model alone but tasks combined have sufficient data; (3) you need a single model that serves multiple use cases. The key challenge is task balancing - prevent high-data tasks from dominating. Use temperature-based sampling: sample task $i$ proportionally to $N_i^T$ where $T = 0.7$ (squashes the ratio between large and small tasks). Multi-task fine-tuning often slightly underperforms single-task fine-tuning for each individual task - the trade-off is generality vs peak single-task performance.

Q4: Explain the difference between adapter layers (Houlsby et al.) and LoRA.

Adapter layers (Houlsby et al., 2019) insert small bottleneck networks between transformer layers: a down-projection from hidden_dim to adapter_size, a nonlinearity, and an up-projection back to hidden_dim. These adapters are trainable; the base model is frozen. LoRA is different: instead of adding new layers, it decomposes the weight update of existing layers into low-rank matrices. The key practical difference: adapters add sequential computation (you must run the adapter network after each transformer layer), which adds inference latency and cannot be eliminated without architecture changes. LoRA's weight matrices can be merged into the base model weights before inference ( $W = W_0 + BA$ ), resulting in zero inference overhead. This is why LoRA has largely replaced adapter methods in practice.

Q5: When would you use prompt tuning over LoRA?

Almost never in modern practice, but prompt tuning has specific advantages: (1) zero inference latency (prepended tokens are part of the input, not separate modules); (2) modular (different soft prompts for different tasks, all using the same frozen base model); (3) extremely few parameters (virtually no storage cost). Prompt tuning works reasonably well for models above 10B parameters and for simple style/format tasks. However, it consistently underperforms LoRA on almost every benchmark. The main use case is when you have a very large, very capable base model (GPT-4 scale), you want to adapt it for multiple tasks with essentially zero compute overhead, and your tasks are simple enough that a context "primer" is sufficient. For most engineering applications, LoRA is strictly better.

Real-World Cost Analysis

Fine-tuning costs on AWS (us-east-1, on-demand, 2025 approximate pricing):

Model	Method	GPU Required	Hours	Approx Cost
7B	Full FT (1K examples)	1x A100 80GB	2h	$25
7B	LoRA r=16 (10K examples)	1x A100 80GB	3h	$38
7B	QLoRA r=16 (10K examples)	1x A10G 24GB	5h	$22
13B	Full FT (10K examples)	2x A100 80GB	4h	$100
13B	LoRA r=16 (50K examples)	1x A100 80GB	8h	$100
70B	QLoRA r=64 (50K examples)	2x A100 80GB	24h	$600

Note: these are rough estimates. Actual costs depend on sequence length, batch size, and gradient accumulation settings.

Cost optimization strategies:

Use spot instances (60-90% cheaper, but can be interrupted - use frequent checkpointing)
Use smaller consumer-grade GPUs with QLoRA (RTX 4090 at $0.50-1.00/hour vs A100 at$ 3-5/hour)
Preprocess and cache all data before training (avoid CPU bottleneck during training)
Use sequence packing (no padding waste) - can reduce training time by 30-50%

Choosing the Right Evaluation Strategy

Different fine-tuning methods require different evaluation approaches:

"""
Comprehensive evaluation suite for fine-tuned models.
Tests both task-specific quality and general capability preservation.
"""

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from typing import List, Dict


class FineTuningEvaluator:
    """
    Evaluates a fine-tuned model on:
    1. Task-specific performance
    2. General capability preservation (regression testing)
    3. Instruction following quality
    """

    def __init__(self, model_path: str, base_model_path: str):
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token

        self.ft_model = AutoModelForCausalLM.from_pretrained(
            model_path, torch_dtype=torch.bfloat16
        )
        self.base_model = AutoModelForCausalLM.from_pretrained(
            base_model_path, torch_dtype=torch.bfloat16
        )

    def compute_perplexity(self, model, texts: List[str]) -> float:
        """Compute average perplexity on a list of texts."""
        import math
        total_loss = 0
        total_tokens = 0

        for text in texts:
            inputs = self.tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
            with torch.no_grad():
                outputs = model(**inputs, labels=inputs["input_ids"])
            total_loss += outputs.loss.item() * inputs["input_ids"].shape[1]
            total_tokens += inputs["input_ids"].shape[1]

        avg_loss = total_loss / total_tokens
        return math.exp(avg_loss)

    def evaluate_instruction_following(
        self,
        test_examples: List[Dict],
        judge_fn=None,
    ) -> Dict:
        """
        Evaluate instruction following quality.
        judge_fn: optional function to score responses (e.g., LLM-as-judge)
        """
        results = {
            "format_adherence": [],
            "completeness": [],
            "ft_vs_base_comparison": [],
        }

        for example in test_examples:
            prompt = example["prompt"]
            expected_format = example.get("expected_format", None)

            # Generate from fine-tuned model
            ft_response = self._generate(self.ft_model, prompt)
            base_response = self._generate(self.base_model, prompt)

            # Format check (if specified)
            if expected_format:
                format_ok = expected_format.lower() in ft_response.lower()
                results["format_adherence"].append(float(format_ok))

            # Use judge if available
            if judge_fn:
                ft_score = judge_fn(prompt, ft_response)
                base_score = judge_fn(prompt, base_response)
                results["ft_vs_base_comparison"].append(ft_score > base_score)

        return {
            "format_adherence_rate": (
                sum(results["format_adherence"]) / len(results["format_adherence"])
                if results["format_adherence"] else None
            ),
            "ft_beats_base_rate": (
                sum(results["ft_vs_base_comparison"]) / len(results["ft_vs_base_comparison"])
                if results["ft_vs_base_comparison"] else None
            ),
        }

    def check_catastrophic_forgetting(
        self,
        general_test_texts: List[str],
        threshold_ppl_increase: float = 0.10,
    ) -> bool:
        """
        Check if fine-tuning caused catastrophic forgetting.
        Returns True if forgetting detected (perplexity increased by more than threshold).
        """
        base_ppl = self.compute_perplexity(self.base_model, general_test_texts)
        ft_ppl = self.compute_perplexity(self.ft_model, general_test_texts)

        ppl_change = (ft_ppl - base_ppl) / base_ppl

        print(f"Base model perplexity: {base_ppl:.2f}")
        print(f"Fine-tuned perplexity: {ft_ppl:.2f}")
        print(f"Change: {ppl_change:+.1%}")

        if ppl_change > threshold_ppl_increase:
            print(f"WARNING: Catastrophic forgetting detected ({ppl_change:.1%} degradation)")
            return True

        print("No significant forgetting detected")
        return False

    def _generate(self, model, prompt: str, max_new_tokens: int = 256) -> str:
        inputs = self.tokenizer(prompt, return_tensors="pt")
        with torch.no_grad():
            output = model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                temperature=0.7,
                top_p=0.9,
                do_sample=True,
                pad_token_id=self.tokenizer.eos_token_id,
            )
        return self.tokenizer.decode(
            output[0][inputs["input_ids"].shape[1]:],
            skip_special_tokens=True,
        )

note

The 2025 consensus on PEFT vs full fine-tuning By 2025, LoRA and QLoRA have effectively won the practical fine-tuning landscape. The vast majority of fine-tuned open-source models use PEFT methods. Full fine-tuning is reserved for foundational use cases: domain pretraining, large-scale alignment training at frontier labs, and specialized models where every fraction of a percent of quality matters. For any application-layer fine-tuning - RAG-grounded models, chatbots, code assistants, domain Q&A - LoRA at r=16 to r=64 is the standard starting point.

Multi-Task Fine-Tuning

One significant advantage of full fine-tuning over PEFT is the ability to train a single model on multiple tasks simultaneously. With LoRA, you can train separate adapters per task, but a single merged model that excels at everything requires full fine-tuning.

from transformers import AutoModelForCausalLM, AutoTokenizer
from torch.utils.data import DataLoader, ConcatDataset, WeightedRandomSampler
import torch

def create_multi_task_dataset(
    task_datasets: dict[str, list[dict]],   # {"task_name": [{instruction, response}]}
    task_weights: dict[str, float] | None = None,
) -> tuple:
    """
    Create a weighted multi-task dataset.
    Upsamples smaller datasets to prevent domination by large ones.
    """
    if task_weights is None:
        # Default: inverse square root weighting (common in multi-task NLP)
        sizes = {t: len(d) for t, d in task_datasets.items()}
        total = sum(sizes.values())
        task_weights = {t: (total / s) ** 0.5 for t, s in sizes.items()}

    # Normalize weights
    weight_sum = sum(task_weights.values())
    task_weights = {t: w / weight_sum for t, w in task_weights.items()}

    all_examples = []
    sample_weights = []

    for task_name, dataset in task_datasets.items():
        for example in dataset:
            all_examples.append({**example, "task": task_name})
            sample_weights.append(task_weights[task_name] / len(dataset))

    # WeightedRandomSampler ensures each epoch sees roughly the desired task mix
    sampler = WeightedRandomSampler(
        weights=sample_weights,
        num_samples=max(len(d) for d in task_datasets.values()) * len(task_datasets),
        replacement=True,
    )

    print("Task dataset sizes:")
    for task, data in task_datasets.items():
        print(f"  {task}: {len(data)} examples (weight: {task_weights[task]:.3f})")

    return all_examples, sampler


def multi_task_fine_tune(
    model_name: str,
    task_datasets: dict[str, list[dict]],
    output_dir: str,
    num_epochs: int = 3,
):
    """Full fine-tuning across multiple tasks with balanced sampling."""
    from transformers import TrainingArguments, Trainer
    from trl import SFTTrainer, DataCollatorForCompletionOnlyLM

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token

    all_examples, sampler = create_multi_task_dataset(task_datasets)

    # Format all examples uniformly
    def format_example(example: dict) -> str:
        task = example.get("task", "general")
        return (
            f"<|system|>\nYou are a helpful assistant specialized in {task}.\n"
            f"<|user|>\n{example['instruction']}\n"
            f"<|assistant|>\n{example['response']}"
        )

    formatted = [{"text": format_example(ex)} for ex in all_examples]

    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=num_epochs,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=8,
        learning_rate=1e-5,     # Lower LR for multi-task to avoid overwriting
        bf16=True,
        lr_scheduler_type="cosine",
        warmup_ratio=0.03,
        logging_steps=50,
        save_strategy="epoch",
    )

    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=formatted,
        args=training_args,
        max_seq_length=2048,
        dataset_text_field="text",
    )
    trainer.train()
    trainer.save_model(output_dir)
    print(f"Multi-task model saved to {output_dir}")

Merging LoRA Adapters for Deployment

One underappreciated advantage of LoRA: you can train multiple specialized adapters and merge them with linear interpolation (model merging). This lets you combine the skills from different fine-tuning runs without retraining.

import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

def merge_lora_adapters(
    base_model_name: str,
    adapter_paths: list[str],
    adapter_weights: list[float] | None = None,
    output_dir: str = "./merged_model",
) -> str:
    """
    Merge multiple LoRA adapters into a single model using TIES-merging.
    Useful for combining specialized adapters (e.g., coding + instruction following).
    """
    if adapter_weights is None:
        adapter_weights = [1.0 / len(adapter_paths)] * len(adapter_paths)

    assert len(adapter_paths) == len(adapter_weights)
    assert abs(sum(adapter_weights) - 1.0) < 1e-6, "Weights must sum to 1.0"

    print(f"Merging {len(adapter_paths)} adapters...")

    # Load base model
    base_model = AutoModelForCausalLM.from_pretrained(
        base_model_name,
        torch_dtype=torch.bfloat16,
    )

    # Collect delta weights from each adapter
    all_deltas = []
    for adapter_path, weight in zip(adapter_paths, adapter_weights):
        peft_model = PeftModel.from_pretrained(base_model, adapter_path)
        merged = peft_model.merge_and_unload()

        # Compute delta from base (adapter contribution only)
        delta = {}
        for name, param in merged.named_parameters():
            base_param = dict(base_model.named_parameters())[name]
            delta[name] = (param - base_param) * weight
        all_deltas.append(delta)

    # Sum deltas and add to base
    with torch.no_grad():
        for name, param in base_model.named_parameters():
            total_delta = sum(d[name] for d in all_deltas)
            param.add_(total_delta)

    tokenizer = AutoTokenizer.from_pretrained(base_model_name)
    base_model.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)
    print(f"Merged model saved to {output_dir}")
    return output_dir


# Example: merge coding adapter + instruction following adapter
result = merge_lora_adapters(
    base_model_name="meta-llama/Meta-Llama-3-8B",
    adapter_paths=["./lora-coding", "./lora-instruction"],
    adapter_weights=[0.6, 0.4],     # Weight coding skills more heavily
    output_dir="./merged-coder-assistant",
)

tip

Model merging without retraining (2024 trend)

Linear merging (SLERP, TIES, DARE) has become a popular alternative to multi-task fine-tuning. Instead of training a single model on many tasks, you:

Fine-tune separate specialist models (or use community LoRA adapters from HuggingFace Hub)
Merge their weights with linear interpolation

Results are surprisingly competitive with multi-task training, especially for tasks that don't conflict (coding + multilingual work well together; coding + creative writing conflict more). Tools: mergekit (open source, supports TIES, DARE, SLERP, Task Arithmetic).

Choosing Your Fine-Tuning Budget

Matching your compute budget to the right approach:

def recommend_fine_tuning_approach(
    model_size_b: float,       # Model size in billions of parameters
    dataset_size_k: int,       # Dataset size in thousands of examples
    available_gpus: int,       # Number of A100 80GB GPUs available
    quality_requirement: str,  # "research_best", "production", "prototype"
    latency_sensitive: bool,
) -> dict:
    """
    Recommend fine-tuning approach based on constraints.
    """
    vram_available = available_gpus * 80  # GB (A100 80GB)

    # Memory requirements (rough estimates)
    full_ft_vram = model_size_b * 16     # Full FT: ~16 bytes/param with optimizer
    lora_vram = model_size_b * 2.5      # LoRA: weights in BF16 + small optimizer state
    qlora_vram = model_size_b * 0.6     # QLoRA: 4-bit base + LoRA adapters

    recommendations = []

    # Check what fits
    if full_ft_vram <= vram_available:
        fit_methods = ["full_fine_tuning", "lora", "qlora"]
    elif lora_vram <= vram_available:
        fit_methods = ["lora", "qlora"]
    else:
        fit_methods = ["qlora"]

    # Apply quality filter
    if quality_requirement == "research_best" and "full_fine_tuning" in fit_methods:
        primary = "full_fine_tuning"
        reason = "Maximum quality - all parameters updated"
    elif quality_requirement == "production":
        if dataset_size_k >= 100 and "full_fine_tuning" in fit_methods:
            primary = "full_fine_tuning"
            reason = "Large dataset + production quality = full FT worth the cost"
        else:
            primary = "lora"
            reason = "LoRA r=32-64 gives 95-98% of full FT quality"
    else:  # prototype
        primary = "qlora"
        reason = "Minimum VRAM, fastest iteration"

    # Latency override
    if latency_sensitive and primary == "qlora":
        print("WARNING: QLoRA inference requires dequantization - adds ~10-15% latency")
        print("Consider training with QLoRA then exporting merged BF16 weights")

    return {
        "primary_recommendation": primary,
        "reason": reason,
        "fits_in_vram": fit_methods,
        "vram_estimate": {
            "full_ft": f"{full_ft_vram:.0f} GB",
            "lora": f"{lora_vram:.0f} GB",
            "qlora": f"{qlora_vram:.0f} GB",
        },
        "alternative": fit_methods[0] if primary != fit_methods[0] else (
            fit_methods[1] if len(fit_methods) > 1 else None
        ),
    }


# Examples:
print(recommend_fine_tuning_approach(
    model_size_b=7,
    dataset_size_k=50,
    available_gpus=1,          # Single A100
    quality_requirement="production",
    latency_sensitive=True,
))
# → lora, r=32-64, single GPU

print(recommend_fine_tuning_approach(
    model_size_b=70,
    dataset_size_k=500,
    available_gpus=8,          # 8x A100 cluster
    quality_requirement="research_best",
    latency_sensitive=False,
))
# → full_fine_tuning with ZeRO-3

Interview Q&A (Extended)

Q6: When does LoRA fail to match full fine-tuning, and what can you do about it?

LoRA can underperform full fine-tuning in three scenarios: (1) When the task requires the model to learn fundamentally new knowledge rather than adapting existing knowledge - e.g., training a general English model on specialized chemistry notation. LoRA's low-rank constraint limits how much new information can be encoded. Fix: increase rank (try r=128 or r=256) or use full fine-tuning. (2) When you need to modify early layers (embeddings, first 4–6 transformer layers) - LoRA applied only to attention matrices may not reach these. Fix: apply LoRA to embed_tokens and MLP layers too, not just attention. (3) When the rank is too low for the task's intrinsic dimensionality. Fix: use AdaLoRA (Lesson 07 extension) which allocates rank dynamically based on importance scores.

Q7: How do you evaluate whether fine-tuning actually helped versus just overfitting?

Three-part evaluation: (1) Hold-out eval on task-specific benchmark - does the model score higher on the target task? (2) General capability retention - run MMLU, HellaSwag, or TruthfulQA before and after fine-tuning. If scores drop more than 2–3 points, you have catastrophic forgetting. (3) Behavioral evaluation - sample 50–100 prompts from both the target domain and general domains; compare outputs side by side. Quantitative benchmarks can miss behavioral degradation that human evaluation catches. For production, always do all three. LoRA fine-tuning rarely causes forgetting because base weights are frozen; full fine-tuning requires careful monitoring.

Key Takeaways

The full fine-tuning vs PEFT decision comes down to a simple question: how much of the model's behavior do you need to change, and what compute can you afford?

Full fine-tuning updates every parameter and can achieve the maximum possible adaptation - it is the right choice when training at frontier scale, when you are doing continued pretraining on a new domain corpus, or when the quality gap between LoRA and full FT is material for your application. It requires significantly more compute and careful management of catastrophic forgetting.

PEFT methods - especially LoRA and QLoRA - have won the practical fine-tuning landscape by demonstrating that you can get 95–98% of full fine-tuning quality at 10–50% of the cost. For the vast majority of applied fine-tuning tasks (domain Q&A, chatbots, code assistants, RAG-grounded models), LoRA r=16 to r=64 is the correct starting point.

The most important insight from the last three years of PEFT research: the gap between PEFT and full fine-tuning is closing, not growing. As rank selection, target module selection, and adapter architectures improve, the practical argument for full fine-tuning in applied settings becomes harder to make. Start with LoRA, measure the quality gap, and only escalate to full fine-tuning if the gap is real and material.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the LoRA: Low-Rank Adaptation demo on the EngineersOfAI Playground - no code required.

:::

The Decision Nobody Talks About Explicitly​

The Memory Wall: A Concrete Comparison​

Memory Requirements by Method (approximate, FP16/BF16 base, Adam optimizer)​

Quality Comparison: Does LoRA Actually Match Full Fine-Tuning?​

The Full Spectrum of PEFT Methods​

Prefix Tuning (Li and Liang, 2021)​

Prompt Tuning (Lester et al., 2021)​

IA3 (Liu et al., 2022)​

DoRA (Liu et al., 2024)​

Adapter Layers (Houlsby et al., 2019)​

When Full Fine-Tuning is Worth the Cost​

When LoRA is the Right Choice​

The Decision Flowchart​

Multi-Task Fine-Tuning​

Practical Recommendations by Use Case​

Code: Comparing Methods​

Production Engineering Notes​

Common Mistakes​

Interview Q&A​

Real-World Cost Analysis​

Choosing the Right Evaluation Strategy​

Multi-Task Fine-Tuning​

Merging LoRA Adapters for Deployment​

Choosing Your Fine-Tuning Budget​

Interview Q&A (Extended)​

Key Takeaways​

The Decision Nobody Talks About Explicitly

The Memory Wall: A Concrete Comparison

Memory Requirements by Method (approximate, FP16/BF16 base, Adam optimizer)

Quality Comparison: Does LoRA Actually Match Full Fine-Tuning?

The Full Spectrum of PEFT Methods

Prefix Tuning (Li and Liang, 2021)

Prompt Tuning (Lester et al., 2021)

IA3 (Liu et al., 2022)

DoRA (Liu et al., 2024)

Adapter Layers (Houlsby et al., 2019)

When Full Fine-Tuning is Worth the Cost

When LoRA is the Right Choice

The Decision Flowchart

Multi-Task Fine-Tuning

Practical Recommendations by Use Case

Code: Comparing Methods

Production Engineering Notes

Common Mistakes

Interview Q&A

Real-World Cost Analysis

Choosing the Right Evaluation Strategy

Multi-Task Fine-Tuning

Merging LoRA Adapters for Deployment

Choosing Your Fine-Tuning Budget

Interview Q&A (Extended)

Key Takeaways