## At a Glance
- **Definition:** A parameter-efficient fine-tuning method that freezes a pre-trained model's weights and injects trainable low-rank matrix pairs ($B$ and $A$) into each target layer.
- **Formula:** $h = W_0 x + B A x$
- **Range / output:** Returns a modified layer activation $h \in \mathbb{R}^d$; at deployment $BA$ is merged into $W_0$, adding zero inference latency.
- **Assumptions:** The weight update $\Delta W$ needed for a downstream task has low intrinsic rank.
- **Extra:** A member of the Hugging Face PEFT (parameter-efficient fine-tuning) family alongside adapters and prefix tuning.
## Why It Exists
Full fine-tuning retrains every parameter in the model. This means storing the full optimizer state for every parameter. For GPT-3 at 175 billion parameters that's roughly 1.2 TB of VRAM. The obvious fix: freezing the model and inserting small adapter modules between layers, avoids the storage problem but adds sequential computation that inflates inference latency by 20–30% at the small batch sizes typical in production. LoRA's insight is that you probably don't need to update a layer's full weight matrix $W$ at all.[^1] Empirically, the change $\Delta W$ that a downstream task requires typically has very low *intrinsic rank* (it lives in a tiny subspace of the full weight space). A low-rank matrix $\Delta W$ can be expressed as the product of two thin matrices $BA$, where the inner dim between $B$ and $A$ is the rank $r$. So, we can train only $B$ and $A$, and merge them back into $W_0$ at deploy time. The latency penalty is exactly zero, and the number of trainable parameters can drop by several orders of magnitude if $r$ is much smaller than either length ($d$ or $k$) of $W$.
## Demo!
<div class="ml-widget" data-algo="lora"></div>
**Tips:**
- Notice that for $\Delta W$ with a decay $\leq$ 0.3 one singular value captures over 90% of the norm!
- Try increasing the decay. As the singular value spectrum gets flatter you need a higher LoRA rank to capture the same % of the norm.
- Notice that full rank LoRA doubles the number of parameters needed (two weight matrices used rather than one).
## Formalization
The demo's captured-energy percentage comes from the singular value spectrum of $\Delta W$. Here is the full picture.
For a pre-trained weight matrix $W_0 \in \mathbb{R}^{d \times k}$, LoRA constrains its update to a low-rank factorisation:
$h = W_0 x + \Delta W\, x = W_0 x + B A x$
- **$W_0$** `[d × k]`: Frozen pre-trained weights. Receives no gradient during fine-tuning.
- **$A$** `[r × k]`: Trained from a random Gaussian initialisation $\mathcal{N}(0, \sigma^2)$.
- **$B$** `[d × r]`: Initialised to zero, so $\Delta W = BA = 0$ at the start of training — the model's outputs are unchanged on day one.
- **$r$**: The rank, typically 1–8 even when $d = 12{,}288$. Satisfies $r \ll \min(d, k)$.
- **Scaling:** $\Delta W x$ is multiplied by $\alpha / r$ before addition; setting $\alpha = r$ on the first rank tried removes it as a free hyperparameter.
**Parameter count.** A single adapted layer has $r(d + k)$ trainable parameters instead of $dk$. At $r = 4$, $d = k = 4096$ that is $4 \times 8192 = 32{,}768$ versus $4096^2 \approx 16.8\text{M}$ — a compression of roughly $500\times$. For GPT-3 175B with $r = 4$ applied only to $W_q$ and $W_v$, the checkpoint shrinks from 350 GB to 35 MB.
**Why low rank works.** The update $\Delta W$ can be written via its singular value decomposition as $\Delta W = U \Sigma V^\top = \sum_{s=1}^{\min(d,k)} \sigma_s u_s v_s^\top$. The rank-$r$ truncation $\Delta W_r = \sum_{s=1}^r \sigma_s u_s v_s^\top$ minimises $\lVert \Delta W - \Delta W_r \rVert_F$ over all rank-$r$ matrices (Eckart-Young theorem). When the spectrum is spiky (a few large $\sigma_s$ and the rest near zero) $\Delta W_r$ captures nearly all of $\lVert \Delta W \rVert_F^2$ at small $r$. The original LoRA paper's find the top singular directions of $\Delta W$ for GPT-3 share more than 0.5 normalised subspace similarity between rank-8 and rank-64 runs, explaining why $r = 1$ suffices for many tasks.
## Worked Example
Take $d = k = 3$, $r = 1$. The full weight matrix has $3 \times 3 = 9$ parameters. The LoRA factorisation uses $B \in \mathbb{R}^{3 \times 1}$ and $A \in \mathbb{R}^{1 \times 3}$: just $3 + 3 = 6$ parameters total, and $BA$ is still a $3 \times 3$ matrix.
Let $B = [2,\ 0,\ 1]^\top$ and $A = [1,\ 3,\ 0]$. Then:
$BA = \begin{bmatrix}2\\0\\1\end{bmatrix}\begin{bmatrix}1 & 3 & 0\end{bmatrix} = \begin{bmatrix}2 & 6 & 0\\0 & 0 & 0\\1 & 3 & 0\end{bmatrix}$
This matrix has rank 1. Every row is a scalar multiple of $[1,\ 3,\ 0]$. It can express updates that push the model's response in *one* direction in output space ($b = [2,0,1]^\top$) for inputs aligned with $[1,3,0]$, and does nothing to inputs orthogonal to that direction.
For an input $x = [1,\ 0,\ 0]^\top$ and frozen output $W_0 x = [0.5,\ -1,\ 0.2]^\top$:
$h = W_0 x + BAx = \begin{bmatrix}0.5\\-1\\0.2\end{bmatrix} + \begin{bmatrix}2\\0\\1\end{bmatrix}\underbrace{[1,3,0]\begin{bmatrix}1\\0\\0\end{bmatrix}}_{=\,1} = \begin{bmatrix}2.5\\-1\\1.2\end{bmatrix}$
The update adds $B \cdot 1$ to the output — it shifts the response along the single direction $b$ by an amount equal to the projection of $x$ onto $a = [1,3,0]^\top$. With $r = 2$ you could express updates along *two* independent directions simultaneously; with $r = d$ you recover full fine-tuning expressiveness.
## Building the Intuition (Deep Dive)
**Why does fine-tuning have low intrinsic rank?** Aghajanyan et al. showed that pre-trained language models occupy a surprisingly small region of their parameter space.[^2] The "intrinsic dimensionality" of the fine-tuning objective is far smaller than the number of parameters. A more recent and more general answer comes from the Universal Weight Subspace Hypothesis, which provides large-scale empirical evidence that trained neural networks always converge to shared low-dimensional subspaces in their weight space, with the geometry of that subspace determined primarily by architecture rather than data. [^3] If this holds, LoRA's rank-r constraint is not just a compression trick. It is constraining ∆W to stay within the architecture's intrinsic low-dimensional subspace.
Assuming the base model has been trained long enough to have found weights near the architectural subspace, the directions in which $W_0$ needs to move for fine-tuning correspond to directions the pre-trained model already has latent sensitivity to but doesn't emphasise. This is stressed in the LoRA paper: $\Delta W$ amplifies directions that are *underrepresented* in the top singular directions of $W_0$. The update fills in gaps rather than reinforcing what the pre-trained model already does well.
**Initialisation matters.** $B$ is initialised to zero so $\Delta W = 0$ at the start of training. This means the fine-tuning trajectory starts from exactly the pre-trained model's behaviour, not a random perturbation of it. $A$ is initialised randomly (Gaussian) so the two matrices break symmetry from the start. If both were zero, no gradient would flow into either.
**Where to apply LoRA.** In a Transformer, there are typically four attention weight matrices per layer ($W_q, W_k, W_v, W_o$) and two in the MLP. The paper finds that adapting $W_q$ and $W_v$ together consistently outperforms adapting only one type, and that spreading a fixed parameter budget across more matrices at lower rank beats concentrating it in one matrix at higher rank.[^1] Adapting all four attention matrices at $r = 2$ matches or beats adapting only $W_q$ at $r = 8$ with the same parameter count.
**No inference latency by construction.** Because $BA$ is the same shape as $W_0$, you can compute $W = W_0 + BA$ once before deployment and discard $B$ and $A$. The deployed model is identical in structure to the original; only the weights differ. This is what separates LoRA from adapter layers, which must be computed at every forward pass.
**Switching tasks.** If you need to serve multiple fine-tuned variants, you can keep one copy of $W_0$ in VRAM and swap only the compact $BA$ pairs (typically 35 MB per task for GPT-3 versus 350 GB for a full checkpoint). This makes task-switching a lightweight operation rather than a model reload.
## Failure Modes & Gotchas
- **Rank too low for the task:** When the downstream task genuinely requires updating many independent directions (eg. adapting to a new language where the model has essentially no prior representation) a small $r$ underfits. The paper notes this is rare for standard NLP tasks but the right $r$ is task-dependent.[^1]
- **Batching different tasks:** If you absorb $BA$ into $W_0$ for deployment, you cannot trivially batch inputs destined for different fine-tuned variants in a single forward pass. You either keep $B$ and $A$ separate (adding a small per-token overhead) or route tasks to separate model shards.
- **Which matrices to adapt:** Applying LoRA only to $W_q$ and missing $W_v$ (or vice versa) measurably hurts performance. The paper recommends adapting both, and the gain from also including $W_k$ and $W_o$ is smaller but nonzero.[^1] Adapting only MLP layers while freezing attention was shown to not work as well.
- **$\alpha$ and learning rate interaction:** The scaling factor $\alpha / r$ effectively rescales the learning rate for the LoRA matrices. Setting $\alpha = r$ (the recommended default) makes this scale-neutral relative to a standard learning rate, but changing $r$ without adjusting $\alpha$ silently changes the effective learning rate.
## Implementation Details & Code
- **Framework equivalents:** Hugging Face `peft` library: `get_peft_model(model, LoraConfig(r=8, lora_alpha=16, target_modules=["q_proj","v_proj"]))`. In pure PyTorch, replace the target linear layer's `forward` with `W0(x) + B(A(x))` where `A` and `B` are `nn.Linear` layers with `bias=False`.
- **Complexity:** Training: $O(r(d + k))$ additional parameters per layer vs $O(dk)$ for full fine-tuning; forward pass is unchanged after merging. Merging $BA$ into $W_0$ costs one $O(drk)$ matrix multiply per adapted layer, done once before deployment.
- **Practical heuristics:** $r \in \{4, 8, 16\}$ covers the vast majority of use cases; start with $r = 8$, $\alpha = 16$. Apply to $W_q$ and $W_v$ at minimum; add $W_k$ and $W_o$ if compute allows. Learning rates in the range $1\text{e-}4$ to $3\text{e-}4$ work well with AdamW. Dropout of 0.05–0.1 on the LoRA matrices helps regularise on small datasets.[
```python
import torch
import torch.nn as nn
class LoRALinear(nn.Module):
"""Drop-in replacement for nn.Linear with a frozen base weight + LoRA update."""
def __init__(self, in_features, out_features, r=4, alpha=None, bias=True):
super().__init__()
self.r = r
self.scale = (alpha if alpha is not None else r) / r
# Frozen pre-trained weights.
self.weight = nn.Parameter(
torch.empty(out_features, in_features), requires_grad=False
)
self.bias_param = nn.Parameter(
torch.zeros(out_features), requires_grad=bias
) if bias else None
# Trainable LoRA matrices: A ~ N(0,1), B = 0 so ΔW=0 at init.
self.A = nn.Parameter(torch.randn(r, in_features))
self.B = nn.Parameter(torch.zeros(out_features, r))
nn.init.kaiming_uniform_(self.weight, a=5**0.5) # mimic Linear default
def forward(self, x):
base = nn.functional.linear(x, self.weight, self.bias_param)
lora = nn.functional.linear(nn.functional.linear(x, self.A), self.B)
return base + self.scale * lora
def merge(self):
"""Absorb BA into W0 for zero-latency deployment. Returns a plain Linear."""
merged = nn.Linear(
self.weight.shape[1], self.weight.shape[0],
bias=self.bias_param is not None
)
merged.weight.data = self.weight + self.scale * (self.B @ self.A)
if self.bias_param is not None:
merged.bias.data = self.bias_param
return merged
```
---
## References
[^1]: Hu, E., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. (2021) 'LoRA: Low-Rank Adaptation of Large Language Models', *arXiv preprint*. Available at: https://arxiv.org/abs/2106.09685
[^2]: Aghajanyan, A., Zettlemoyer, L., and Gupta, S. (2020) 'Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning', *arXiv preprint*. Available at: https://arxiv.org/abs/2012.13255
[^3]: Kaushik, P., Chaudhari, S., Vaidya, A., Chellappa, R., and Yuille, A. (2025) 'The Universal Weight Subspace Hypothesis', arXiv preprint. Available at: https://arxiv.org/abs/2512.05117
<!-- ==========================================================================
WIDGET TECHNICAL SPECIFICATION (hidden from published note)
============================================================================
### SPECIFICATION: lora
1. STATE SCHEMA:
{ seed: 0x4C52, DIM: 8,
U: Array[8][8] orthonormal (fixed from seed via gramSchmidt),
V: Array[8][8] orthonormal (fixed from seed via gramSchmidt),
r: 1 (int, 1–8), decay: 0.30 (float, 0.10–1.00),
rReadout: label ref, decayReadout: label ref }
2. MATH ENGINE ADDITIONS (publish.js Layer 2):
NEW: MLMath.gramSchmidt(n, rng) — modified Gram-Schmidt on Gaussian draws.
REUSE: makeRng, gaussian, dot.
3. INTERACTION MAPPING:
No canvas pointer interaction. Two sliders only.
Slider 1: LoRA rank (r), min=1 max=8 step=1 value=1
Slider 2: Spectrum decay, min=0.10 max=1.00 step=0.05 value=0.30
4. RENDER LOOP:
onFrame: omitted (static widget).
draw(ctx, W, H):
Top 58%: three 8×8 heatmaps — ΔW target | B·A rank-r | Residual.
Bottom 32%: 8 singular-value bars, first r accent-coloured.
Status: LoRA param count vs full, compression ratio.
=========================================================================== -->