Skip to main content
Open In Colab This example fits the degree-9 polynomial model to the noisy sinusoid by stochastic gradient descent (SGD) on the regularized empirical risk, the same ridge objective solved in closed form on the linear regression page. There the regularization strength was tuned by sweeping the closed-form solution and reading off the value λ3.04×103\lambda^\ast \approx 3.04\times10^{-3} that minimizes held-out error. Here you reach the same kind of solution iteratively, and you reuse that λ\lambda^\ast to anchor the choice of regularization rather than re-tuning from scratch.

The dataset

You use the identical training set as the closed-form page: ten points of sin(2πx)\sin(2\pi x) on [0,1][0,1] corrupted by Gaussian noise with standard deviation 0.250.25. Holding the data fixed is what lets the value λ\lambda^\ast carry over unchanged.
def sinusoidal(x):
    return np.sin(2 * np.pi * x)

def create_toy_data(func, sample_size, std, domain=[0, 1]):
    x = np.linspace(domain[0], domain[1], sample_size)
    np.random.shuffle(x)
    y = func(x) + np.random.normal(scale=std, size=x.shape)
    return x, y

np.random.seed(1)                       # same draw as the closed-form page
x_train, y_train = create_toy_data(sinusoidal, 10, 0.25)
Output from cell 3

Standardized polynomial features

The model is a degree-9 polynomial, g(x;θ)=yˉ+k=19θkϕk(x)g(x;\boldsymbol{\theta}) = \bar{y} + \sum_{k=1}^{9}\theta_k\,\phi_k(x). Raw monomials x,x2,,x9x, x^2, \dots, x^9 on [0,1][0,1] span many orders of magnitude, which makes the gradient steps lopsided and the penalty λθ2\lambda\lVert\boldsymbol{\theta}\rVert^2 act unevenly across coordinates. Standardizing each feature to zero mean and unit variance puts every coordinate on the same footing, so a single learning rate and a single λ\lambda are meaningful and the value of λ\lambda^\ast transfers directly from the closed-form page, which uses the same standardized features. The intercept is absorbed by centering the targets at yˉ\bar{y}.
M = 9
P_train = np.vander(x_train, M + 1, increasing=True)[:, 1:]   # columns x^1 .. x^9
mu, sd = P_train.mean(0), P_train.std(0) + 1e-12

def featurize(xq):
    P = np.vander(np.ravel(xq), M + 1, increasing=True)[:, 1:]
    return (P - mu) / sd

Phi_train = featurize(x_train)
y_bar = y_train.mean()                  # intercept; SGD fits the centered residual

# held-out validation set for scoring lambda (same construction as the closed-form page)
x_val = np.linspace(0, 1, 100)
y_val = sinusoidal(x_val) + np.random.RandomState(0).normal(scale=0.25, size=x_val.size)
Phi_val = featurize(x_val)

The regularized objective and the SGD update

SGD minimizes the same ridge objective the closed-form page solves exactly, the sum of squared residuals plus an 2\ell_2 penalty, J(θ)=i=1n(ϕiθy~i)2+λθ2,y~i=yiyˉ.J(\boldsymbol{\theta}) = \sum_{i=1}^{n}\big(\boldsymbol{\phi}_i^{\top}\boldsymbol{\theta} - \tilde{y}_i\big)^2 + \lambda\lVert\boldsymbol{\theta}\rVert^2, \qquad \tilde{y}_i = y_i - \bar{y}. Its minimizer satisfies the normal equations (ΦΦ+λI)θ=Φy~(\boldsymbol{\Phi}^{\top}\boldsymbol{\Phi} + \lambda\mathbf{I})\boldsymbol{\theta} = \boldsymbol{\Phi}^{\top}\tilde{\mathbf{y}}, so using this convention makes λ\lambda here identical to the closed-form λ\lambda^\ast. Each SGD step draws a mini-batch B\mathcal{B} of size BB and follows an unbiased estimate of the full gradient, scaling the batch sum by n/Bn/B: θθη(2nBΦB(ΦBθy~B)+2λθ).\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \eta\Big(\tfrac{2n}{B}\,\boldsymbol{\Phi}_{\mathcal{B}}^{\top}(\boldsymbol{\Phi}_{\mathcal{B}}\boldsymbol{\theta} - \tilde{\mathbf{y}}_{\mathcal{B}}) + 2\lambda\boldsymbol{\theta}\Big).
def sgd_fit(lam, lr=0.004, epochs=20000, batch_size=5, seed=0):
    g = np.random.default_rng(seed)
    yc = y_train - y_bar                # centered target
    n = len(y_train)
    theta = np.zeros(M)
    train_hist, val_hist = [], []
    for _ in range(epochs):
        order = g.permutation(n)
        for s in range(0, n, batch_size):
            b = order[s:s + batch_size]
            err = Phi_train[b] @ theta - yc[b]
            grad = (2 * n / len(b)) * (Phi_train[b].T @ err) + 2 * lam * theta
            theta -= lr * grad
        res = Phi_train @ theta - yc
        train_hist.append(np.sum(res**2) + lam * np.sum(theta**2))
        val_hist.append(np.mean((Phi_val @ theta + y_bar - y_val) ** 2))
    return theta, np.array(train_hist), np.array(val_hist)

Choosing λ\lambda: a chicken-and-egg shortcut

Running SGD needs a value of λ\lambda, yet λ\lambda is itself a hyperparameter you are supposed to choose by comparing held-out error across candidates. That circularity is the chicken-and-egg: you cannot run a fit until you commit to a λ\lambda, but you cannot score a λ\lambda until you run the fit. A full search over many decades of λ\lambda, each one a complete SGD run, is expensive. The closed-form page already broke the circle once, locating λ3.04×103\lambda^\ast \approx 3.04\times10^{-3} on this exact data. You reuse that result: first fit SGD at λ\lambda^\ast itself, then search only a narrow band around it. Restricting the search to one decade either side of λ\lambda^\ast is the shortcut, you trust the closed-form page to have found the right neighborhood.
LAMBDA_STAR = 3.04e-3        # optimum of the closed-form ridge fit (companion page)

theta_sgd, train_hist, val_hist = sgd_fit(LAMBDA_STAR)

# closed-form ridge at the same lambda, for comparison
theta_cf = np.linalg.solve(
    Phi_train.T @ Phi_train + LAMBDA_STAR * np.eye(M), Phi_train.T @ (y_train - y_bar))

xq = np.linspace(0, 1, 200)
gap = np.max(np.abs(featurize(xq) @ theta_sgd - featurize(xq) @ theta_cf))
print(f"validation MSE at lambda*           : {val_hist[-1]:.4f}")
print(f"max |SGD - closed-form| over [0, 1] : {gap:.3f}")
validation MSE at lambda*           : 0.0911
max |SGD - closed-form| over [0, 1] : 0.163
Output from cell 7

Searching a narrow band around λ\lambda^\ast

Now treat λ\lambda as the quantity to optimize, but only over a narrow log-range bracketing λ\lambda^\ast, one decade either side. Each trial runs a full SGD fit and reports the best validation MSE, and the search keeps the λ\lambda that minimizes it.
import optuna
optuna.logging.set_verbosity(optuna.logging.WARNING)

def objective(trial):
    lam = trial.suggest_float("lambda", LAMBDA_STAR / 10, LAMBDA_STAR * 10, log=True)
    _, _, val_hist = sgd_fit(lam)
    return val_hist.min()

study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=30)

print(f"lambda* (closed form) : {LAMBDA_STAR:.2e}")
print(f"best lambda (SGD)     : {study.best_params['lambda']:.2e}")
print(f"best validation MSE   : {study.best_value:.4f}")
/workspaces/eng-ai-agents/.venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
lambda* (closed form) : 3.04e-03
best lambda (SGD)     : 3.07e-04
best validation MSE   : 0.0858
Output from cell 9

Takeaways

  • SGD minimizes the regularized empirical risk: the penalty enters the gradient as 2λθ2\lambda\boldsymbol{\theta}, shrinking the weights every step and curbing the degree-9 overfitting. At λ\lambda^\ast the SGD curve sits almost on top of the closed-form ridge fit.
  • Standardizing the features and adopting the sum-of-squares convention make a single λ\lambda meaningful and let λ\lambda^\ast transfer unchanged from the closed-form page.
  • The narrow search lands slightly below λ\lambda^\ast. This is expected: λ\lambda^\ast was tuned for the fully converged least-squares solution, whereas iterative SGD adds its own implicit regularization, so it needs a little less explicit shrinkage. Anchoring the search to λ\lambda^\ast still puts you in the right neighborhood, which is the whole point of the shortcut.
Key references: (Keskar et al., 2016; Bottou et al., 2016; Andrychowicz et al., 2016)

References

  • Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M., Pfau, D., et al. (2016). Learning to learn by gradient descent by gradient descent.
  • Bottou, L., Curtis, F., Nocedal, J. (2016). Optimization Methods for Large-Scale Machine Learning.
  • Keskar, N., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P. (2016). On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima.