Linear Regression - lecture notes and code examples.
Now that we have introduced somewhat more formally the learning problem and its notation lets us study a simple but instructive regression problem that is known in the statistics literature as shrinkage.Suppose that we are given the training set X={x(1),…,x(m)} together with their labels {y(1),…,y(m)}. We need to construct a model such that a suitably chosen loss function is minimized for a different set of input data, the so-called test set. The ability to correctly predict when observing the test set, is called generalization.Since the output y is a continuous variable, this supervised problem is a regression problem (otherwise it is a classification problem). To make it concrete, read the input x as normalized time over a single tidal cycle (x=0 at one high tide, x=1 at the next) and the output y as the water level relative to mean sea level, in meters. A tide gauge records the level at a sequence of times, and those readings are your training points. The water climbs to a high, falls to a low, and rises again: one smooth cycle. We draw the readings from sin(2πx)+ϵ, with measurement noise ϵ∼N(μ=0,σ=0.25). That underlying tidal curve is completely unknown to you; you see only the noisy gauge readings, and your hypothesis has to recover the smooth tide beneath them.
Let us now pick the hypothesis set that corresponds in general to the following sum,g(x;θ)=θ0+∑j=1M−1θjϕj(x)=θ⊤ϕ(x)where ϕj(x) are known as basis functions.Evaluating the basis at every training input and stacking the results row by row gives the design (featurization) matrixΦ, whose i-th row is ϕ(x(i))⊤. For the polynomial basis ϕj(x)=xj it is the Vandermonde matrixΦ=ϕ(x(1))⊤ϕ(x(2))⊤⋮ϕ(x(m))⊤=11⋮1x(1)x(2)⋮x(m)(x(1))2(x(2))2⋮(x(m))2⋯⋯⋱⋯(x(1))M(x(2))M⋮(x(m))M∈Rm×(M+1),so that all m predictions collapse into a single matrix product y^=Φθ. With this notation the ridge objective is ∥Φθ−y∥2+λ∥θ∥2, whose minimizer solves the normal equations (Φ⊤Φ+λI)θ=Φ⊤y. The fitting code below builds exactly this Φ with np.vander, then standardizes its columns for numerical stability.A set of Polynomial, Gaussian and Sigmoidal basis functions are plotted below.
x = np.linspace(-1, 1, 100)X_polynomial = PolynomialFeature(11).transform(x[:, None])X_gaussian = GaussianFeature(np.linspace(-1, 1, 11), 0.1).transform(x)X_sigmoidal = SigmoidalFeature(np.linspace(-1, 1, 11), 10).transform(x)plt.figure(figsize=(20, 5))for i, X in enumerate([X_polynomial, X_gaussian, X_sigmoidal]): plt.subplot(1, 3, i + 1) for j in range(12): plt.plot(x, X[:, j])txt = "Polynomial, Gaussian and Sigmoidal basis functions"plt.figtext(0.5, 0.01, txt, wrap=True, horizontalalignment="center", fontsize=12)
Text(0.5, 0.01, 'Polynomial, Gaussian and Sigmoidal basis functions')
Using the polynomial basis, we fit candidate hypotheses g(x;θ) of increasing order M to the training data. The figure below overlays the fits for M=0,1,2,3,9 on a single plot. Low-order models underfit, while M=9 wiggles through every training point.
Our job is to find θ such that the polynomial expansion above fits the data we are given - as we will see there are multiple hypothesis that can satisfy this requirement. Consistent with the block diagram we need to define a metric, an figure of merit or loss function in fact, that is also a common metric in regression problems of this nature. This is the Mean Squared Error (MSE) function, which here plays the role of the empirical risk on the training set:Remp(θ)=m1∑i=1m(g(x(i);θ)−y(i))2We can visualize this loss for a specific hypothesis. The green segments below are the displacements (residuals) between each training point and the order-M=2 hypothesis g; the MSE is the average of their squares.
The loss function chosen for this regression problem corresponds to the sum of the squares of the displacements of each data point and our hypothesis. The sum of squares in the case of Gaussian errors gives rise to an (unbiased) Maximum Likelihood estimate of the model parameters. Contrast this to the sum of absolute differences.Now our job has become to choose two things: the parameter vector θ∗andM the order of the polynomial. Both define our hypothesis. If you think about it, the order M defines the model complexity in the sense that the larger M becomes the more the number of parameters we need to estimate and store. Obviously this is a trivial example and storage is not a concern here but treat this example as instructive for that it applies in far for complicated settings.Sweeping the model order M and recording the MSE on the training set and on a held-out test set exposes the tension between fitting and generalization: the training error keeps decreasing with M, but the test error eventually rises, the signature of overfitting.
# held-out noisy test set drawn from the same distributionx_eval = np.linspace(0, 1, 100)y_eval = np.sin(2 * np.pi * x_eval) + np.random.RandomState(0).normal(scale=0.25, size=x_eval.size)orders = list(range(0, 10))train_mse = [np.mean((np.polyval(np.polyfit(x_train, y_train, M), x_train) - y_train) ** 2) for M in orders]test_mse = [np.mean((np.polyval(np.polyfit(x_train, y_train, M), x_eval) - y_eval) ** 2) for M in orders]plt.figure(figsize=[10, 8])plt.plot(orders, train_mse, "-o", label="training")plt.plot(orders, test_mse, "-o", label="test")plt.xlabel("$M$ (polynomial order)")plt.ylabel("MSE")plt.legend()plt.show()
Obviously you can reduce the training error to almost zero by selecting a model that is complicated enough (M=9) to perfectly fit the training data (if m is small).But this is not what you want to do. Because when met with test data, the model will perform far worse than a less complicated model that is closer to the true model (e.g. M=3). This is a central observation in statistical learning called overfitting. In addition, you may not have the time to iterate over M (very important in online learning settings).
M = 9coeffs = np.polyfit(x_train, y_train, M)[::-1] # theta_0 .. theta_Mplt.figure(figsize=[10, 8])plt.plot(range(M + 1), coeffs, "-o")plt.xlabel("coefficient index $j$")plt.ylabel(r"$\theta_j$")plt.title("Unregularized $M=9$ coefficients")plt.show()
To avoid overfitting we have multiple strategies. One straightforward one is evident by observing the wild oscillations of the θ elements as the model complexity increases. We can penalize such oscillations by introducing the ℓ2 norm of θ in our loss function.Remp(θ)=m1∑i=1m(g(x(i);θ)−y(i))2+λ∥θ∥2This type of solution is called regularization and because we effectively shrink the parameter dynamic range it is also called in statistics shrinkage or ridge regression. We have introduced a new parameter λ that regulates the relative importance of the penalty term as compared to the MSE. This parameter together with the polynomial order is what we call hyperparameters and we need to optimize them as both are needed for the determination of our final hypothesis g.The graph below show the results of each search iteration on the λ hyperparameter.Sweeping the regularization strength λ for the M=9 model and recording the MSE on the training set and on a held-out test set reproduces Bishop’s λ-optimization curve. At very small λ the model overfits (low training error, high test error); too large a λ underfits; the test error is minimized at an intermediate λ∗.
# fit M=9 with ridge on standardized polynomial features (numerically stable),# sweeping the penalty lambda and recording MSE on the training and a held-out test set.def std_ridge_predict(M, lam): P = np.vander(x_train, M + 1, increasing=True)[:, 1:] mu, sd = P.mean(0), P.std(0) + 1e-12 Z = (P - mu) / sd theta = np.linalg.solve(Z.T @ Z + lam * np.eye(M), Z.T @ (y_train - y_train.mean())) return lambda xq: ((np.vander(xq, M + 1, increasing=True)[:, 1:] - mu) / sd) @ theta + y_train.mean()x_eval = np.linspace(0, 1, 100)y_eval = np.sin(2 * np.pi * x_eval) + np.random.RandomState(0).normal(scale=0.25, size=x_eval.size)lambdas = np.logspace(-7, 3, 30)train_mse, test_mse = [], []for lam in lambdas: g = std_ridge_predict(9, lam) train_mse.append(np.mean((g(x_train) - y_train) ** 2)) test_mse.append(np.mean((g(x_eval) - y_eval) ** 2))plt.figure(figsize=[10, 8])plt.plot(np.log(lambdas), train_mse, "-o", label="training")plt.plot(np.log(lambdas), test_mse, "-o", label="test")plt.xlabel(r"$\ln \lambda$")plt.ylabel("MSE")plt.legend()plt.show()
The table below lists the fitted coefficient coordinates for M=3,6,9, without regularization and with ridge regularization (λ=0.1). Without regularization the polynomial coefficients grow explosively with the model order, reaching the order of 106 at M=9 (the wild oscillation plotted above). Ridge penalizes the whole coefficient vector (typically excluding the intercept), so it reduces the overall coefficient norm; in this polynomial example the poorly constrained high-order behavior is what is most visibly suppressed. Individual coordinates, especially with correlated or standardized polynomial features, should not be read as each moving monotonically toward zero. Ridge also does not set individual high-order terms exactly to zero; that kind of sparsity is associated with L1 / Lasso.
coordinate
M=3
M=3 (ridge)
M=6
M=6 (ridge)
M=9
M=9 (ridge)
θ0
-0.100
-0.043
0.013
-0.043
0.020
-0.043
θ1
9.282
-0.026
4.566
0.523
340.17
0.477
θ2
-27.567
-1.224
-0.845
-1.150
-7.71e+03
-0.890
θ3
18.416
0.925
-34.211
-0.887
6.86e+04
-0.713
θ4
42.758
-0.213
-3.18e+05
-0.320
θ5
-14.294
0.469
8.50e+05
0.004
θ6
2.143
1.056
-1.37e+06
0.217
θ7
1.30e+06
0.332
θ8
-6.73e+05
0.372
θ9
1.47e+05
0.358
From shrinkage to pruning in real models. Ridge only shrinks the coefficients toward zero; it keeps all of them. In a large neural network the analogous, more aggressive step is pruning: the weights that regularization has driven close to zero are set to exactly zero, so they can be dropped to compress and speed up the model (this is closer in spirit to the L1 / Lasso sparsity mentioned above than to ridge). This is a standard step in production: NVIDIA’s TensorRT Model Optimizer prunes, imposes 2:4 structured sparsity (half the weights in every group zeroed), and quantizes (INT8 / FP8), and the TensorRT runtime then executes the resulting sparse, low-precision network on the GPU’s Sparse Tensor Cores. So “drop the coefficients close to zero” is not only a textbook idea, it is a core step in deploying real models.