PEFT — LoRA, QLoRA & Adapters

Training 0.1% of parameters — efficient fine-tuning

Advanced30 min

Prerequisites

  • pretraining-finetuning

What if I told you that you could customize a 70-billion-parameter model — making it an expert in your specific domain — using a single consumer GPU? And that you'd only need to train 0.1% of the parameters? A few years ago this sounded impossible. Then LoRA came along and changed the economics of AI completely.

Full fine-tuning means updating every single parameter in the model. For a 70B model, that means 70 billion numbers, each needing gradients stored in memory, each being nudged by the optimizer. The hardware requirements are staggering: multiple high-end GPUs, weeks of compute, and a bill that can easily reach six figures. Most teams simply cannot afford it.

Parameter-efficient fine-tuning (PEFT) methods solve this by asking a simple but powerful question: do we really need to change all 70 billion parameters? The answer turns out to be no — not even close. Techniques like LoRA, QLoRA, and adapters let you customize massive models by training less than 1% of the parameters, achieving quality that is often indistinguishable from full fine-tuning. This is the lesson where fine-tuning goes from “only Big Tech can do this” to “I can do this on my gaming PC.”

The fine-tuning cost problem

To understand why PEFT methods matter, you need to feel the pain of full fine-tuning. When you fine-tune a model, the optimizer needs to store not just the model weights, but also the gradients and the optimizer states (momentum, variance) for every parameter. For a 70B model in FP16, the weights alone take ~140 GB. Add gradients and Adam optimizer states, and you need roughly 280–320 GB of GPU memory. That is four A100 80GB GPUs just to hold everything in memory — before you even start training.

But here is the key insight that makes PEFT possible: researchers discovered that the weight updates during fine-tuning are low-rank. In plain English, this means the changes you make to a model when you fine-tune it live in a much smaller space than the full weight matrix. You don't need to update all 70 billion parameters because most of the “directions” in weight space are irrelevant to your specific task. The model already knows most of what it needs — you just need to steer it slightly.

Full fine-tuning vs PEFT

Full Model (70B)
All parameters
Full Fine-tune
70B params updated
Expensive!
4x A100, weeks, $50K+

The PEFT alternative

Full Model (70B)
Frozen weights
LoRA / QLoRA
0.1% trained
Small Adapter
~50-200 MB
Affordable!
1x GPU, hours, ~$100

The economic difference is not incremental — it is orders of magnitude. PEFT does not just make fine-tuning cheaper. It makes it accessible. A graduate student with a single GPU can now do what previously required a well-funded lab. A startup can customize a model for their domain without needing to raise a fundraising round to pay the GPU bill. This democratization is one of the most important developments in applied AI.

Three ways to fine-tune efficiently

PEFT is a family of techniques, not a single method. They all share the same goal — customize a model without touching most of the parameters — but they approach it differently. LoRA is by far the most popular, QLoRA pushes the memory savings even further, and adapter methods offer an alternative architecture.

The Core Insight: Low-Rank Updates

When you fine-tune a large model on a specific task, the weight changes are low-rank. Think of it this way: a 70B model has learned an enormous amount of general knowledge. Fine-tuning for, say, medical question answering does not require rewriting all that knowledge. It requires a relatively small adjustment — a nudge in a specific direction in the vast space of possible weight configurations.

// The math behind the insight:
W_finetuned = W_pretrained + delta_W
// delta_W has low intrinsic rank!
delta_W ≈ A × B   // where A is d×r, B is r×d, and r << d

This was formally shown in the 2021 paper “Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning” by Aghajanyan et al. They found that 90% of the learning happens in a space that is orders of magnitude smaller than the full parameter space. LoRA exploits this directly.

Step 1 of 4

Compare PEFT methods interactively

Use this interactive visualizer to compare full fine-tuning, LoRA, and QLoRA side by side. Toggle between the three methods to see how many parameters each one trains. Adjust the LoRA rank slider to explore the tradeoff between efficiency and expressiveness. Pay attention to the dramatic difference in GPU memory requirements.

PEFT Parameter Efficiency Visualizer

Compare full fine-tuning, LoRA, and QLoRA side by side. Toggle between methods to see how many parameters are actually trained, and adjust the LoRA rank to explore the quality-efficiency tradeoff.

Freeze the base model. Add small trainable low-rank matrices beside each weight matrix.

Model Parameters (12×12 = 144 shown)
Frozen
LoRA adapters (~0.1%)
Trainable Params
0.10%
70M of 70B
GPU Memory
~18 GB
1x A100 / 2x RTX 4090
Training Time
~8 hours
70B model, single task
Estimated Cost
$200-500
Cloud GPU pricing
LoRA Rank (r)
r = 8
Parameters:
Moderate trainable
Quality:
Good for most tasks
Speed:
Fast training
Most practitioners use rank 8 or 16. Rank 4 works well for simple domain adaptation.
LoRA Weight Decomposition
W
Frozen
d × d
+
A
Trained
d × r
×
B
Trained
r × d
=
W+AB
Adapted
d × d
With r=8: instead of training d×d = d² parameters, you train 2×d×8 = 16d parameters
GPU Memory Required (70B Model)
Full FT
320 GB
LoRA
18 GB
QLoRA
10 GB

QLoRA uses 32x less memory than full fine-tuning

When to Use Each Method

Full Fine-tuning
  • • Maximum quality needed
  • • Unlimited GPU budget
  • • Major domain shift required
  • • Training from a small base model
LoRA
  • • Best quality-to-cost ratio
  • • Access to datacenter GPUs
  • • Need to swap adapters at inference
  • • Most production use cases
QLoRA
  • • Consumer GPU (24-48 GB)
  • • Tight budget or experimentation
  • • Fine-tuning very large models
  • • Research and prototyping

AI connection: LoRA and QLoRA are how most open-source model customizations happen today. When you see models like “CodeLlama” or “Llama-Chat,” many are built by applying LoRA adapters to a base model. A single base model can serve hundreds of different tasks by hot-swapping tiny adapter files — each just a few megabytes — at inference time.

Key Takeaways

  • Full fine-tuning updates all parameters and requires enormous GPU resources. For a 70B model, expect 300+ GB of memory and costs exceeding $50,000. Most teams cannot afford this.
  • LoRA freezes the base model and adds small trainable low-rank matrices (A and B) beside each weight matrix. This reduces trainable parameters to ~0.1% while maintaining quality comparable to full fine-tuning.
  • QLoRA goes further by quantizing the base model to 4-bit before adding LoRA adapters, reducing memory requirements by another 3-4x. A 70B model becomes trainable on a single 48GB consumer GPU.
  • The key insight enabling all PEFT methods is that weight updates during fine-tuning are low-rank — the changes needed to specialize a model live in a much smaller space than the full parameter count suggests.
  • LoRA adapters can be merged back into the base model for zero-overhead inference, or kept separate and hot-swapped to serve multiple tasks from a single base model. This makes PEFT practical for production deployment.

Common Misconceptions

  • "LoRA makes models dumber." -- It does not. Multiple studies have shown that LoRA quality is often comparable to full fine-tuning, especially at rank 16 or higher. The weight updates during fine-tuning are naturally low-rank, so LoRA is not throwing away important information — it is exploiting the structure that was already there.

Quick check

Why does LoRA work so well despite training less than 1% of parameters?