Skip to main content
Before reading this section, ensure you are familiar with the Linear Algebra Annex. After reading this section, consider doing this assignment.
Large linear operators appear throughout modern machine learning systems. Yet empirical evidence shows that meaningful changes to such systems often occupy only a small number of directions. This tutorial develops that idea from first principles, using Gaussian distributions to isolate the geometry of low-rank change before connecting it to modern architectures like LoRA (Low-Rank Adaptation).

Part I: Covariance and Geometry

Covariance as shape

Let xRdx \in \mathbb{R}^d be a zero-mean random vector with covariance Σ=E[xx].\Sigma = \mathbb{E}[x x^\top]. Eigenvectors of Σ\Sigma define orthogonal directions in space, and eigenvalues measure the variance along those directions. Geometrically, Σ\Sigma describes the shape of a probability ellipsoid.
Internalize that eigenvalues are not abstract quantities - they directly correspond to observable spread in data.

Latent variable construction

Instead of specifying Σ\Sigma directly, we construct it from a lower-dimensional latent variable: zN(0,Ik),x=Wz+ε,εN(0,σ2Id).z \sim \mathcal{N}(0, I_k), \quad x = W z + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, \sigma^2 I_d). This implies Σ=WW+σ2Id.\Sigma = W W^\top + \sigma^2 I_d. The matrix WW determines the signal subspace; the noise term fills the remaining directions isotropically.
This is the same construction used in classical factor analysis and probabilistic PCA.

Rank as a geometric constraint

Because rank(WW)k\mathrm{rank}(W W^\top) \le k, only kk directions can carry variance above the noise level. This explains why many high-dimensional datasets exhibit:
  • Rapidly decaying eigenvalue spectra
  • Effective low dimensionality

Part II: Structured Coefficient Changes

A reference distribution

Fix a reference matrix W0W_0 with covariance Σ0=W0W0+σ2Id.\Sigma_0 = W_0 W_0^\top + \sigma^2 I_d. This defines a baseline Gaussian distribution. All subsequent changes are measured relative to this geometry.

Low-rank coefficient modifications

We now restrict changes to the form ΔW=BA,\Delta W = B A, where:
  • BB is a d×rd \times r matrix
  • AA is an r×kr \times k matrix
  • rmin(d,k)r \ll \min(d,k)
Only rr new directions can be introduced.
Low-rank structure does not limit how much the matrix changes, but where it can change.

Diffuse coefficient changes

For comparison, consider dense changes with no preferred directions, scaled to match the Frobenius norm of the low-rank case. This contrast isolates the role of structure from magnitude.

Observable consequences

Empirically, one finds:
  • Low-rank changes alter a small number of eigenvalues dramatically
  • Diffuse changes alter many eigenvalues modestly
This establishes the central principle:
Rank limits the number of variance directions that can be modified.

Part III: Likelihood and Statistical Geometry

Empirical covariance

Given samples {xi}i=1n\{x_i\}_{i=1}^n, the empirical covariance S=1ni=1nxixiS = \frac{1}{n} \sum_{i=1}^n x_i x_i^\top is a sufficient statistic for zero-mean Gaussian models.

Gaussian likelihood geometry

The Gaussian negative log-likelihood is L(Σ)=12[logdetΣ+tr ⁣(Σ1S)].\mathcal{L}(\Sigma) = \frac{1}{2} \left[ \log \det \Sigma + \mathrm{tr}\!\left(\Sigma^{-1} S\right) \right]. Geometric interpretation:
  • logdetΣ\log \det \Sigma penalizes volume mismatch
  • tr(Σ1S)\mathrm{tr}(\Sigma^{-1} S) penalizes directional mismatch
You may recognize this as a Riemannian geometry on the cone of positive definite matrices.

Rank-constrained covariance matching

Expanding ΣΣ0=W0ΔW+ΔWW0+ΔWΔW\Sigma - \Sigma_0 = W_0 \Delta W^\top + \Delta W W_0^\top + \Delta W \Delta W^\top implies rank(ΣΣ0)2r.\mathrm{rank}(\Sigma - \Sigma_0) \le 2r. Thus a rank-rr coefficient change can only modify a limited number of eigen-directions, regardless of dimensionality. This fact explains the likelihood saturation observed in experiments.

Part IV: Connection to LoRA in Transformers

LoRA (Low-Rank Adaptation) applies the same structural assumption to large linear operators:
  • A reference matrix is fixed (the pretrained weights)
  • Adaptation is constrained to a low-rank subspace
  • Rank controls expressive capacity
Transformers obscure this geometry with nonlinearities and attention mechanisms. The Gaussian setting exposes it directly.
# LoRA structure in practice
# Original: y = Wx
# LoRA:     y = Wx + BAx  where B ∈ R^{d×r}, A ∈ R^{r×k}

class LoRALayer:
    def __init__(self, W, rank):
        self.W = W  # frozen pretrained weights
        self.B = initialize_zeros(W.shape[0], rank)
        self.A = initialize_random(rank, W.shape[1])

    def forward(self, x):
        return self.W @ x + self.B @ (self.A @ x)

Final Takeaway

Low-rank adaptation is not an optimization trick. It is a geometric assumption about how complex systems change.
  • When that assumption holds, low-rank methods are statistically efficient
  • When it does not, no algorithm can avoid higher-dimensional modification
Key references: (McInnes et al., 2018; Neumann et al., 2017; Dauphin et al., 2014; Pascal et al., 2013; Sun et al., 2016)

References

  • Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., et al. (2014). Identifying and attacking the saddle point problem in high-dimensional non-convex optimization.
  • McInnes, L., Healy, J., Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.
  • Neumann, D., Wiese, T., Utschick, W. (2017). Learning the MMSE Channel Estimator.
  • Pascal, F., Bombrun, L., Tourneret, J., Berthoumieu, Y. (2013). Parameter Estimation For Multivariate Generalized Gaussian Distributions.
  • Sun, B., Feng, J., Saenko, K. (2016). Correlation Alignment for Unsupervised Domain Adaptation.