The dataset
You use the identical training set as the closed-form page: ten points of on corrupted by Gaussian noise with standard deviation . Holding the data fixed is what lets the value carry over unchanged.
Standardized polynomial features
The model is a degree-9 polynomial, . Raw monomials on span many orders of magnitude, which makes the gradient steps lopsided and the penalty act unevenly across coordinates. Standardizing each feature to zero mean and unit variance puts every coordinate on the same footing, so a single learning rate and a single are meaningful and the value of transfers directly from the closed-form page, which uses the same standardized features. The intercept is absorbed by centering the targets at .The regularized objective and the SGD update
SGD minimizes the same ridge objective the closed-form page solves exactly, the sum of squared residuals plus an penalty, Its minimizer satisfies the normal equations , so using this convention makes here identical to the closed-form . Each SGD step draws a mini-batch of size and follows an unbiased estimate of the full gradient, scaling the batch sum by :Choosing : a chicken-and-egg shortcut
Running SGD needs a value of , yet is itself a hyperparameter you are supposed to choose by comparing held-out error across candidates. That circularity is the chicken-and-egg: you cannot run a fit until you commit to a , but you cannot score a until you run the fit. A full search over many decades of , each one a complete SGD run, is expensive. The closed-form page already broke the circle once, locating on this exact data. You reuse that result: first fit SGD at itself, then search only a narrow band around it. Restricting the search to one decade either side of is the shortcut, you trust the closed-form page to have found the right neighborhood.
Searching a narrow band around
Now treat as the quantity to optimize, but only over a narrow log-range bracketing , one decade either side. Each trial runs a full SGD fit and reports the best validation MSE, and the search keeps the that minimizes it.
Takeaways
- SGD minimizes the regularized empirical risk: the penalty enters the gradient as , shrinking the weights every step and curbing the degree-9 overfitting. At the SGD curve sits almost on top of the closed-form ridge fit.
- Standardizing the features and adopting the sum-of-squares convention make a single meaningful and let transfer unchanged from the closed-form page.
- The narrow search lands slightly below . This is expected: was tuned for the fully converged least-squares solution, whereas iterative SGD adds its own implicit regularization, so it needs a little less explicit shrinkage. Anchoring the search to still puts you in the right neighborhood, which is the whole point of the shortcut.
References
- Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M., Pfau, D., et al. (2016). Learning to learn by gradient descent by gradient descent.
- Bottou, L., Curtis, F., Nocedal, J. (2016). Optimization Methods for Large-Scale Machine Learning.
- Keskar, N., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P. (2016). On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima.

