Skip to main content
The score-based view treats generative modeling as estimating the gradient of the log-density, xlogpt(x)\nabla_x \log p_t(x), rather than the density itself. The DDPM training objective and DDIM sampler you have already seen are special cases: DDPM trains a noise predictor ϵθ(xt,t)\epsilon_\theta(x_t, t) which is, up to a known scaling factor, an estimate of the score; DDIM’s deterministic limit is the probability-flow ODE derived from the score-SDE. This page is a tutorial reading guide for Yang Song’s MIT CBMM lecture below. Each section opens with one slide from the Topography of the Noise deck and walks through the corresponding lecture segment. You can read straight through, or skip into the video at the timestamp noted in each section. Topography of the Noise — title slide showing the reverse trajectory from a flat noise plain back to peaked probability mass over the data manifold

The intractable wall

Diagram of the normalizing-constant bottleneck. Left: a 16,000-pixel image must be mapped to a probability distribution. Right: the neural network output must be divided by Z, the integral over all possible inputs, which is computationally impossible in high dimensions. A neural network is a black box from xRDx \in \mathbb{R}^D to a scalar fθ(x)Rf_\theta(x) \in \mathbb{R}. To turn that scalar into a probability density you need pθ(x)  =  exp(fθ(x))Zθ,Zθ=exp(fθ(x))dx.p_\theta(x) \;=\; \frac{\exp(f_\theta(x))}{Z_\theta}, \qquad Z_\theta = \int \exp(f_\theta(x))\, dx. For images, DD is in the tens of thousands. The integral ZθZ_\theta has no closed form and Monte-Carlo estimates are useless at this dimensionality. Every classical likelihood-based generative model — autoregressive, normalizing flows, VAEs — makes some architectural sacrifice to keep ZθZ_\theta tractable: factorize over coordinates, restrict to invertible maps, or optimize a lower bound. Each restriction either limits expressiveness or limits how directly you can score real samples. The score-based view simply refuses to compute ZθZ_\theta. Lecture: open at ≈ 8:00 for Yang Song’s framing of this bottleneck.

Trade the map for a compass: the score function

On the left, a vector field of arrows aligning with the contours of a probability landscape. On the right, the identity ∇ log p(x) = ∇ f(x) − ∇ log Z, with the second term crossed out: the gradient of a constant is zero, so the intractable normalizing constant disappears. Take the gradient of the log of both sides: sθ(x)  :=  xlogpθ(x)  =  xfθ(x)xlogZθ  =  xfθ(x).s_\theta(x) \;:=\; \nabla_x \log p_\theta(x) \;=\; \nabla_x f_\theta(x) - \cancel{\nabla_x \log Z_\theta} \;=\; \nabla_x f_\theta(x). The gradient of a constant is zero. The intractable ZθZ_\theta vanishes from the gradient field, and what remains — the score — is something a neural network can output directly: a vector field with the same shape as the input. A density p(x)p(x) is a map of where probability mass lives. The score xlogp(x)\nabla_x \log p(x) is a compass that tells you, at any xx, which direction increases log-density. You don’t need the map to navigate: a compass at every point is enough to roll downhill toward the data. Lecture: ≈ 18:00.

The low-density trap

Two panels. Left: 'Structured regions (real data)' — a clean inward-pointing vector field around a sharp peak. Right: 'Vast noise plains' — gray, randomly-oriented arrows where the model has no training data; a generated particle is trapped, unable to find the data manifold. Naive score matching has a fatal flaw. Training data {xi}pdata\{x_i\} \sim p_{\text{data}} lives on (or near) a thin manifold inside an enormous ambient space. Score matching minimizes the expected squared error between sθ(x)s_\theta(x) and the true score, with the expectation taken under pdatap_{\text{data}}: L(θ)  =  Expdata ⁣[sθ(x)xlogpdata(x)2].\mathcal{L}(\theta) \;=\; \mathbb{E}_{x \sim p_{\text{data}}}\!\left[\,\bigl\|\, s_\theta(x) - \nabla_x \log p_{\text{data}}(x) \,\bigr\|^2\right]. The model is only trained where the data lives. Far from the manifold the loss puts no weight at all, so the learned compass spins randomly there. At sampling time you start from a Gaussian sample — pure noise, with overwhelming probability far from the data — and the compass gives you no signal to follow back. You stay lost. This is the problem the rest of the lecture is built to solve. Lecture: ≈ 36:00.

Noise as a bridge

Three time snapshots. At t=0 the probability landscape has three sharp spikes and the rest is empty. By t=1 noise has begun to spread the spikes outward. By t=2 the landscape is a smooth, broad hill with a continuous gradient everywhere — the analogue of ink dispersing into water. The fix is physical. Inject Gaussian noise into the data and let it spread the probability mass outward over time. Sharp spikes broaden into smooth bumps, empty valleys fill in. After enough noise, the landscape has a non-zero, smoothly-varying gradient everywhere — and a network trained to match the score of the noisy distribution will have a well-defined target across the whole space. This is the non-equilibrium thermodynamics picture: a drop of dye in water starts as a localized concentration (t=0t=0) and ends as a uniform mixture (tt \to \infty). Score-based models reverse that diffusion. The forward process buys you a usable signal across the noise plain; the reverse process trades it back for samples on the data manifold. This is exactly the forward noising process you saw in the Brownian-motion section — eight trajectories spreading from the origin — restated in the language of densities rather than particles. Lecture: ≈ 38:00.

The forward SDE

The forward stochastic differential equation dx = f(x,t) dt + g(t) dW, decomposed into the drift term f(x,t) (an inward pull preventing infinite spreading) and the diffusion term g(t) dW (Gaussian noise injection — a Wiener process). In continuous time the noising process is a stochastic differential equation: dx  =  f(x,t)dt  +  g(t)dWt,t[0,T],dx \;=\; f(x, t)\, dt \;+\; g(t)\, dW_t, \qquad t \in [0, T], with x0pdatax_0 \sim p_{\text{data}} and WtW_t a standard Wiener process.
  • Drift f(x,t)f(x, t): a deterministic pull. In Variance-Preserving (VP) SDEs the drift contracts xx toward the origin to stop variance from blowing up; in Variance-Exploding (VE) SDEs it is zero.
  • Diffusion g(t)g(t): a scalar volatility schedule controlling how fast Gaussian noise is injected.
DDPM is the discrete-time Variance-Preserving instance of this SDE, with the noise schedule βt\beta_t playing the role of ff and gg. The forward trajectory you trained against in the DDPM tutorial is one Euler-Maruyama discretization of exactly this SDE. Lecture: ≈ 1:00:00.

Training the score network

Three columns comparing score-matching loss variants. Vanilla score matching computes the exact Jacobian trace — perfect accuracy, O(D) backprops, unscalable for images. Sliced score matching projects onto random 1D directions — O(1) cost via vector-Jacobian product, efficient but high variance. Denoising score matching adds noise and uses the analytical gradient of the perturbation kernel — O(1) cost, matches the SDE framework, the standard choice. The implicit score-matching loss Epdata ⁣[12sθ(x)2  +  tr ⁣(xsθ(x))]\mathbb{E}_{p_{\text{data}}}\!\left[\,\tfrac{1}{2}\|s_\theta(x)\|^2 \;+\; \mathrm{tr}\!\left(\nabla_x s_\theta(x)\right)\right] removes the unknown logpdata\nabla \log p_{\text{data}} via integration by parts, but the Jacobian trace requires DD separate backward passes — fatal for images. Three workarounds:
VariantTrickCostTrade-off
VanillaCompute tr(sθ)\mathrm{tr}(\nabla s_\theta) exactlyO(D)O(D) backpropsExact, unusable above ~1k dims
SlicedRandom unit vectors vv; replace trace by v(sθ)vv^\top (\nabla s_\theta)\, vO(1)O(1) via Jacobian-vector productUnbiased, higher variance
Denoising (DSM)Perturb xx with kernel qσ(x~x)q_\sigma(\tilde x \mid x); train against the known analytic score x~logqσ(x~x)\nabla_{\tilde x} \log q_\sigma(\tilde x \mid x)O(1)O(1), no JacobiansEstimates the score of the noisy density, not the clean one
DSM is the standard choice. With Gaussian perturbation x~=x+σϵ,ϵN(0,I)\tilde x = x + \sigma \epsilon, \epsilon \sim \mathcal{N}(0, I): x~logqσ(x~x)  =  x~xσ2  =  ϵσ,\nabla_{\tilde x} \log q_\sigma(\tilde x \mid x) \;=\; -\frac{\tilde x - x}{\sigma^2} \;=\; -\frac{\epsilon}{\sigma}, so the DSM loss reduces to a re-weighted noise-prediction loss — this is exactly the DDPM training objective up to a σ\sigma-dependent constant. The DDPM noise predictor ϵθ\epsilon_\theta and the score network sθs_\theta are the same model with a different sign and scale. Lecture: ≈ 22:00–34:00 walks through all three variants.

Reversing the SDE

The reverse SDE equation dx = [f(x,t) − g²(t) ∇ log p(x)] dt + g(t) dW̄, with the score term highlighted as 'our trained neural network model'. Below: a curved arrow leading from a noisy random-pixel image on the right back to a clean data sample on the left, with compass icons marking each integration step. Anderson’s 1982 reversal theorem is the engine of generation. Every forward diffusion has a reverse-time partner: dx  =  [f(x,t)    g(t)2xlogpt(x)]dt  +  g(t)dWˉt,dx \;=\; \bigl[\,f(x, t) \;-\; g(t)^2\, \nabla_x \log p_t(x)\,\bigr]\, dt \;+\; g(t)\, d\bar W_t, where dWˉtd\bar W_t is a Wiener process running backward in time and ptp_t is the marginal density at time tt under the forward SDE. The only unknown on the right-hand side is xlogpt(x)\nabla_x \log p_t(x) — and that is precisely what the score network was trained to estimate. Substitute sθ(x,t)xlogpt(x)s_\theta(x, t) \approx \nabla_x \log p_t(x) and you have a generator: start from a sample of pTN(0,σT2I)p_T \approx \mathcal{N}(0, \sigma_T^2 I) and integrate backward to t=0t = 0 to land on the data manifold. This is the same idea as annealed Langevin dynamics: at each tt, take a step downhill along sθ(,t)s_\theta(\cdot, t) plus a noise kick, and gradually decrease the noise level. The continuous-time SDE view subsumes the discrete annealed-Langevin sampler used in NCSN and the ancestral sampler used in DDPM. Lecture: ≈ 1:04:00.

Two solvers: stochastic vs deterministic

Two panels comparing samplers on the same probability hill. Left, 'Reverse SDE (Langevin dynamics)': a jagged, randomized trajectory with the diffusion term retained — diverse samples but hundreds of slow, tiny steps for stability. Right, 'Probability flow ODE': a smooth deterministic curve descending the hill — drops the diffusion term, transports probability mass identically, requires roughly 20× fewer steps, and unlocks a uniquely identifiable latent space. For every score-SDE there is a deterministic ODE that produces the same marginal densities ptp_t: dxdt  =  f(x,t)    12g(t)2xlogpt(x).\frac{dx}{dt} \;=\; f(x, t) \;-\; \tfrac{1}{2}\, g(t)^2\, \nabla_x \log p_t(x). This is the probability-flow ODE. The diffusion term is gone — only drift remains — but the family of distributions {pt}\{p_t\} swept out by the ODE is identical to the SDE’s. Two consequences:
  1. Fewer steps. ODE solvers (Heun, DPM-Solver, RK45) converge in 20–50 NFEs versus the 500–1000 needed by reverse-SDE samplers. This is what production systems use.
  2. A bijection between data and noise. The ODE is invertible. Each data point maps to a unique latent code in the prior, which gives you exact log-likelihoods (via the change-of-variables formula and Hutchinson’s estimator) and a meaningful semantic latent space — for free, after training.
The deterministic DDIM sampler (η=0\eta = 0) you implemented in the DDIM tutorial is a first-order discretization of this ODE. Lecture: ≈ 1:07:00.

Conditional generation via Bayes’ rule

The score decomposition ∇ log p(x|y) = ∇ log p(x) + ∇ log p(y|x), with the unconditional base model on the left, the forward model (condition) in the middle, and the final guided score on the right. Below, an example: sparse CT projections feeding into 'forward simulation model + prior', producing a high-fidelity reconstruction of an abdominal scan. Conditioning is almost free in score space. Take logs and gradients of Bayes’ rule: xlogp(xy)  =  xlogp(x)unconditional score  +  xlogp(yx)condition / likelihood.\nabla_x \log p(x \mid y) \;=\; \underbrace{\nabla_x \log p(x)}_{\text{unconditional score}} \;+\; \underbrace{\nabla_x \log p(y \mid x)}_{\text{condition / likelihood}}. The unconditional score is your already-trained diffusion prior. The likelihood term comes from a forward model — a classifier, a degradation operator, a physics simulator. Add the two scores at every step of the reverse SDE/ODE and you sample from the posterior p(xy)p(x \mid y) without retraining the prior. This single identity unifies a remarkable range of applications:
  • Inverse problems in medical imaging. yy is a sparse-view CT sinogram; p(yx)p(y \mid x) is the (linear, known) Radon transform plus measurement noise. The diffusion prior is a generic image model trained once, and the same prior reconstructs MRI, CT, and microscopy images.
  • Class-conditional generation. p(yx)p(y \mid x) is a classifier; classifier guidance (and its classifier-free cousin) is the same idea.
  • Text-to-image. yy is a text embedding; the conditional score is approximated jointly with the unconditional one in a single network.
Lecture: ≈ 1:14:00.

Where score-based models fit

A 4×3 comparison table of generative model families. Diffusion (score-based) achieves state-of-the-art sample quality, high mode coverage, unconstrained U-Net architectures, and exact likelihoods via ODE solvers. GANs offer high quality but suffer mode collapse, demand a rigid generator-discriminator pair, and yield no likelihoods. VAEs cover modes well but produce blurry samples, accept only constrained encoder-decoder architectures, and report only approximate likelihoods. The score-SDE framework inherits the strengths of earlier families and shares few of their weaknesses:
  • Sample quality rivals or exceeds GANs on benchmark image datasets (FID on CIFAR-10, ImageNet, LSUN).
  • Mode coverage is high: there is no adversary to collapse onto a few easy modes; the loss is a simple regression.
  • Architectural freedom matches GANs — any network that maps RDRD\mathbb{R}^D \to \mathbb{R}^D works as a score model. No invertibility constraint, no encoder-decoder bottleneck.
  • Exact likelihoods are available via the probability-flow ODE — something GANs cannot offer at all and VAEs only bound.
The trade-off is sampling cost: even with ODE solvers, score-based generation is slower per sample than a single GAN forward pass. Distillation, consistency models, and flow matching are active research directions aimed at closing that gap.

End-to-end picture

A four-quadrant cycle. Phase 1, forward SDE: clean data points are spread into noise. Phase 2, score matching: a U-Net learns the vector field of the perturbed densities. The generation phase navigates backward through the reverse SDE / ODE. The output is high-fidelity synthesized data. The cycle closes: physical intuition (non-equilibrium thermodynamics) on one side, mathematical key (learning the vector field) on the other. Putting the pieces together:
  1. Forward SDE dx=f(x,t)dt+g(t)dWtdx = f(x, t)\, dt + g(t)\, dW_t smoothly noises clean data into a tractable prior.
  2. Score matching trains sθ(x,t)s_\theta(x, t) against the perturbation kernel via the DSM loss — equivalent to noise prediction up to scaling.
  3. Reverse SDE / probability-flow ODE plugs the trained score back into the dynamics and integrates from t=Tt = T to t=0t = 0.
  4. Output: samples drawn from pdatap_{\text{data}}, with optional conditioning grafted on by additive scores.
DDPM, DDIM, NCSN, EDM, and modern latent-diffusion text-to-image systems are all instances of this same skeleton with different drift/diffusion choices, parameterizations of sθs_\theta, and ODE/SDE solvers.

Hands-on companions

Three sections elsewhere in this chapter let you exercise each piece of the framework on a low-dimensional problem you can plot:
  • Score on a mixture of Gaussians — derive and visualize xlogp(x)\nabla_x \log p(x) for a 2D MoG, then watch Langevin dynamics descend it.
  • Yang Song’s tutorial section — the official MNIST score-SDE walkthrough: VE-SDE, NCSN++, ancestral and Predictor-Corrector samplers, and the probability-flow ODE.
  • Brownian motion — eight trajectories of dx=g(t)dWtdx = g(t)\, dW_t, the forward SDE in particle form.
  • DDPM on a 2D MoG — the discrete-time Variance-Preserving instance, with noise prediction.
  • DDIM on a 2D MoG — the deterministic (η=0\eta = 0) sampler, a first-order discretization of the probability-flow ODE.

References

  1. Song, Sohl-Dickstein, Kingma, Kumar, Ermon, Poole. Score-Based Generative Modeling through Stochastic Differential Equations. ICLR 2021. arxiv.org/abs/2011.13456
  2. Song, Ermon. Generative Modeling by Estimating Gradients of the Data Distribution. NeurIPS 2019. arxiv.org/abs/1907.05600
  3. Vincent. A Connection Between Score Matching and Denoising Autoencoders. Neural Computation 2011.
  4. Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 1982.
  5. Hyvärinen. Estimation of non-normalized statistical models by score matching. JMLR 2005.
  6. Song, Garg, Shi, Ermon. Sliced Score Matching: A Scalable Approach to Density and Score Estimation. UAI 2019. arxiv.org/abs/1905.07088