LoRA -- Low Rank Adaptation

## At a Glance - **Definition:** A parameter-efficient fine-tuning method that freezes a pre-trained model's weights and injects trainable low-rank matrix pairs ($B$ and $A$) into each target layer. - **Formula:** $h = W_0 x + B A x$ - **Range / output:** Returns a modified layer activation $h \in \mathbb{R}^d$; at deployment $BA$ is merged into $W_0$, adding zero inference latency. - **Assumptions:** The weight update $\Delta W$ needed for a downstream task has low intrinsic rank. - **Extra:** A member of the Hugging Face PEFT (parameter-efficient fine-tuning) family alongside adapters and prefix tuning. ## Why It Exists Full fine-tuning retrains every parameter in the model. This means storing the full optimizer state for every parameter. For GPT-3 at 175 billion parameters that's roughly 1.2 TB of VRAM. The obvious fix: freezing the model and inserting small adapter modules between layers, avoids the storage problem but adds sequential computation that inflates inference latency by 20–30% at the small batch sizes typical in production. LoRA's insight is that you probably don't need to update a layer's full weight matrix $W$ at all.[^1] Empirically, the change $\Delta W$ that a downstream task requires typically has very low *intrinsic rank* (it lives in a tiny subspace of the full weight space). A low-rank matrix $\Delta W$ can be expressed as the product of two thin matrices $BA$, where the inner dim between $B$ and $A$ is the rank $r$. So, we can train only $B$ and $A$, and merge them back into $W_0$ at deploy time. The latency penalty is exactly zero, and the number of trainable parameters can drop by several orders of magnitude if $r$ is much smaller than either length ($d$ or $k$) of $W$. ## Demo! <div class="ml-widget" data-algo="lora"></div> **Tips:** - Notice that for $\Delta W$ with a decay $\leq$ 0.3 one singular value captures over 90% of the norm! - Try increasing the decay. As the singular value spectrum gets flatter you need a higher LoRA rank to capture the same % of the norm. - Notice that full rank LoRA doubles the number of parameters needed (two weight matrices used rather than one). ## Formalization The demo's captured-energy percentage comes from the singular value spectrum of $\Delta W$. Here is the full picture. For a pre-trained weight matrix $W_0 \in \mathbb{R}^{d \times k}$, LoRA constrains its update to a low-rank factorisation: $h = W_0 x + \Delta W\, x = W_0 x + B A x$ - **$W_0$** `[d × k]`: Frozen pre-trained weights. Receives no gradient during fine-tuning. - **$A$** `[r × k]`: Trained from a random Gaussian initialisation $\mathcal{N}(0, \sigma^2)$. - **$B$** `[d × r]`: Initialised to zero, so $\Delta W = BA = 0$ at the start of training. - **$r$**: The rank, typically 1–8 even when $d = 12{,}288$. Satisfies $r \ll \min(d, k)$. - **Scaling:** $\Delta W x$ is multiplied by $\alpha / r$ before addition; setting $\alpha = r$ on the first rank tried removes it as a free hyperparameter. **Parameter count.** A single adapted layer has $r(d + k)$ trainable parameters instead of $dk$. At $r = 4$, $d = k = 4096$ that is $4 \times 8192 = 32{,}768$ versus $4096^2 \approx 16.8\text{M}$ (a compression of roughly $500\times$). For GPT-3 175B with $r = 4$ applied only to $W_q$ and $W_v$, the checkpoint shrinks from 350 GB to 35 MB. **Why low rank works.** The update $\Delta W$ can be written via its singular value decomposition as $\Delta W = U \Sigma V^\top = \sum_{s=1}^{\min(d,k)} \sigma_s u_s v_s^\top$. The rank-$r$ truncation $\Delta W_r = \sum_{s=1}^r \sigma_s u_s v_s^\top$ minimises $\lVert \Delta W - \Delta W_r \rVert_F$ over all rank-$r$ matrices (Eckart-Young theorem). When the spectrum is spiky (a few large $\sigma_s$ and the rest near zero) $\Delta W_r$ captures nearly all of $\lVert \Delta W \rVert_F^2$ at small $r$. The original LoRA paper finds the top singular directions of $\Delta W$ for GPT-3 share more than 0.5 normalised subspace similarity between rank-8 and rank-64 runs, explaining why $r = 1$ suffices for many tasks. ## Worked Example Take $d = k = 3$, $r = 1$. The full weight matrix has $3 \times 3 = 9$ parameters. The LoRA factorisation uses $B \in \mathbb{R}^{3 \times 1}$ and $A \in \mathbb{R}^{1 \times 3}$: just $3 + 3 = 6$ parameters total, and $BA$ is still a $3 \times 3$ matrix. Let $B = [2,\ 0,\ 1]^\top$ and $A = [1,\ 3,\ 0]$. Then: $BA = \begin{bmatrix}2\\0\\1\end{bmatrix}\begin{bmatrix}1 & 3 & 0\end{bmatrix} = \begin{bmatrix}2 & 6 & 0\\0 & 0 & 0\\1 & 3 & 0\end{bmatrix}$ This matrix has rank 1. Every row is a scalar multiple of $[1,\ 3,\ 0]$. It can express updates that push the model's response in *one* direction in output space ($b = [2,0,1]^\top$) for inputs aligned with $[1,3,0]$, and does nothing to inputs orthogonal to that direction. For an input $x = [1,\ 0,\ 0]^\top$ and frozen output $W_0 x = [0.5,\ -1,\ 0.2]^\top$: $h = W_0 x + BAx = \begin{bmatrix}0.5\\-1\\0.2\end{bmatrix} + \begin{bmatrix}2\\0\\1\end{bmatrix}\underbrace{[1,3,0]\begin{bmatrix}1\\0\\0\end{bmatrix}}_{=\,1} = \begin{bmatrix}2.5\\-1\\1.2\end{bmatrix}$ The update adds $B \cdot 1$ to the output. It shifts the response along the single direction $b$ by an amount equal to the projection of $x$ onto $a = [1,3,0]^\top$. With $r = 2$ you could express updates along *two* independent directions simultaneously; with $r = d$ you recover full fine-tuning expressiveness. ## Building the Intuition (Deep Dive) **Why does fine-tuning have low intrinsic rank?** Aghajanyan et al. showed that pre-trained language models occupy a surprisingly small region of their parameter space.[^2] The "intrinsic dimensionality" of the fine-tuning objective is far smaller than the number of parameters. A more recent and more general answer comes from the Universal Weight Subspace Hypothesis, which provides large-scale empirical evidence that trained neural networks always converge to shared low-dimensional subspaces in their weight space, with the geometry of that subspace determined primarily by architecture rather than data. [^3] If this holds, LoRA's rank-r constraint is not just a compression trick. It is constraining ∆W to stay within the architecture's intrinsic low-dimensional subspace. Assuming the base model has been trained long enough to have found weights near the architectural subspace, the directions in which $W_0$ needs to move for fine-tuning correspond to directions the pre-trained model already has latent sensitivity to but doesn't emphasise. This is stressed in the LoRA paper: $\Delta W$ amplifies directions that are *underrepresented* in the top singular directions of $W_0$. The update fills in gaps rather than reinforcing what the pre-trained model already does well. **Initialisation matters.** $B$ is initialised to zero so $\Delta W = 0$ at the start of training. This means the fine-tuning trajectory starts from exactly the pre-trained model's behaviour, not a random perturbation of it. $A$ is initialised randomly (Gaussian) so the two matrices break symmetry from the start. If both were zero, no gradient would flow into either. **Where to apply LoRA.** In a Transformer, there are typically four attention weight matrices per layer ($W_q, W_k, W_v, W_o$) and two in the MLP. The paper finds that adapting $W_q$ and $W_v$ together consistently outperforms adapting only one type, and that spreading a fixed parameter budget across more matrices at lower rank beats concentrating it in one matrix at higher rank.[^1] Adapting all four attention matrices at $r = 2$ matches or beats adapting only $W_q$ at $r = 8$ with the same parameter count. **No inference latency by construction.** Because $BA$ is the same shape as $W_0$, you can compute $W = W_0 + BA$ once before deployment and discard $B$ and $A$. The deployed model is identical in structure to the original; only the weights differ. This is what separates LoRA from adapter layers, which must be computed at every forward pass. **Switching tasks.** If you need to serve multiple fine-tuned variants, you can keep one copy of $W_0$ in VRAM and swap only the compact $BA$ pairs (typically 35 MB per task for GPT-3 versus 350 GB for a full checkpoint). This makes task-switching a lightweight operation rather than a model reload. ## Failure Modes & Gotchas - **Rank too low for the task:** When the downstream task genuinely requires updating many independent directions (eg. adapting to a new language where the model has essentially no prior representation) a small $r$ underfits. The paper notes this is rare for standard NLP tasks but the right $r$ is task-dependent.[^1] - **Batching different tasks:** If you absorb $BA$ into $W_0$ for deployment, you cannot trivially batch inputs destined for different fine-tuned variants in a single forward pass. You either keep $B$ and $A$ separate (adding a small per-token overhead) or route tasks to separate model shards. - **Which matrices to adapt:** Applying LoRA only to $W_q$ and missing $W_v$ (or vice versa) measurably hurts performance. The paper recommends adapting both, and the gain from also including $W_k$ and $W_o$ is smaller but nonzero.[^1] Adapting only MLP layers while freezing attention was shown to not work as well. - **$\alpha$ and learning rate interaction:** The scaling factor $\alpha / r$ effectively rescales the learning rate for the LoRA matrices. Setting $\alpha = r$ (the recommended default) makes this scale-neutral relative to a standard learning rate, but changing $r$ without adjusting $\alpha$ silently changes the effective learning rate. ## Implementation Details & Code - **Framework equivalents:** Hugging Face `peft` library: `get_peft_model(model, LoraConfig(r=8, lora_alpha=16, target_modules=["q_proj","v_proj"]))`. In pure PyTorch, replace the target linear layer's `forward` with `W0(x) + B(A(x))` where `A` and `B` are `nn.Linear` layers with `bias=False`. - **Complexity:** Training: $O(r(d + k))$ additional parameters per layer vs $O(dk)$ for full fine-tuning; forward pass is unchanged after merging. Merging $BA$ into $W_0$ costs one $O(drk)$ matrix multiply per adapted layer, done once before deployment. - **Practical heuristics:** $r \in \{4, 8, 16\}$ covers the vast majority of use cases; start with $r = 8$, $\alpha = 16$. Apply to $W_q$ and $W_v$ at minimum; add $W_k$ and $W_o$ if compute allows. Learning rates in the range $1\text{e-}4$ to $3\text{e-}4$ work well with AdamW. Dropout of 0.05–0.1 on the LoRA matrices helps regularise on small datasets.[ ```python import torch import torch.nn as nn class LoRALinear(nn.Module): """Drop-in replacement for nn.Linear with a frozen base weight + LoRA update.""" def __init__(self, in_features, out_features, r=4, alpha=None, bias=True): super().__init__() self.r = r self.scale = (alpha if alpha is not None else r) / r # Frozen pre-trained weights. self.weight = nn.Parameter( torch.empty(out_features, in_features), requires_grad=False ) self.bias_param = nn.Parameter( torch.zeros(out_features), requires_grad=bias ) if bias else None # Trainable LoRA matrices: A ~ N(0,1), B = 0 so ΔW=0 at init. self.A = nn.Parameter(torch.randn(r, in_features)) self.B = nn.Parameter(torch.zeros(out_features, r)) nn.init.kaiming_uniform_(self.weight, a=5**0.5) # mimic Linear default def forward(self, x): base = nn.functional.linear(x, self.weight, self.bias_param) lora = nn.functional.linear(nn.functional.linear(x, self.A), self.B) return base + self.scale * lora def merge(self): """Absorb BA into W0 for zero-latency deployment. Returns a plain Linear.""" merged = nn.Linear( self.weight.shape[1], self.weight.shape[0], bias=self.bias_param is not None ) merged.weight.data = self.weight + self.scale * (self.B @ self.A) if self.bias_param is not None: merged.bias.data = self.bias_param return merged ``` --- ## References [^1]: Hu, E., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. (2021) 'LoRA: Low-Rank Adaptation of Large Language Models', *arXiv preprint*. Available at: https://arxiv.org/abs/2106.09685 [^2]: Aghajanyan, A., Zettlemoyer, L., and Gupta, S. (2020) 'Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning', *arXiv preprint*. Available at: https://arxiv.org/abs/2012.13255 [^3]: Kaushik, P., Chaudhari, S., Vaidya, A., Chellappa, R., and Yuille, A. (2025) 'The Universal Weight Subspace Hypothesis', arXiv preprint. Available at: https://arxiv.org/abs/2512.05117