Score-based generative models

The score-based view treats generative modeling as estimating the gradient of the log-density,

\nabla_x \log p_t(x)

, rather than the density itself. The DDPM training objective and DDIM sampler you have already seen are special cases: DDPM trains a noise predictor

\epsilon_\theta(x_t, t)

which is, up to a known scaling factor, an estimate of the score; DDIM’s deterministic limit is the probability-flow ODE derived from the score-SDE. This page is a tutorial reading guide for Yang Song’s MIT CBMM lecture below. Each section opens with one slide from the Topography of the Noise deck and walks through the corresponding lecture segment. You can read straight through, or skip into the video at the timestamp noted in each section.

Topography of the Noise — title slide showing the reverse trajectory from a flat noise plain back to peaked probability mass over the data manifold

The intractable wall

Diagram of the normalizing-constant bottleneck. Left: a 16,000-pixel image must be mapped to a probability distribution. Right: the neural network output must be divided by Z, the integral over all possible inputs, which is computationally impossible in high dimensions.

A neural network is a black box from

x \in \mathbb{R}^D

to a scalar

f_\theta(x) \in \mathbb{R}

. To turn that scalar into a probability density you need

p_\theta(x) \;=\; \frac{\exp(f_\theta(x))}{Z_\theta}, \qquad Z_\theta = \int \exp(f_\theta(x))\, dx.

For images,

D

is in the tens of thousands. The integral

Z_\theta

has no closed form and Monte-Carlo estimates are useless at this dimensionality. Every classical likelihood-based generative model — autoregressive, normalizing flows, VAEs — makes some architectural sacrifice to keep

Z_\theta

tractable: factorize over coordinates, restrict to invertible maps, or optimize a lower bound. Each restriction either limits expressiveness or limits how directly you can score real samples. The score-based view simply refuses to compute

Z_\theta

. _{Lecture: open at ≈ 8:00 for Yang Song’s framing of this bottleneck.}

Trade the map for a compass: the score function

On the left, a vector field of arrows aligning with the contours of a probability landscape. On the right, the identity ∇ log p(x) = ∇ f(x) − ∇ log Z, with the second term crossed out: the gradient of a constant is zero, so the intractable normalizing constant disappears.

Take the gradient of the log of both sides:

s_\theta(x) \;:=\; \nabla_x \log p_\theta(x) \;=\; \nabla_x f_\theta(x) - \cancel{\nabla_x \log Z_\theta} \;=\; \nabla_x f_\theta(x).

The gradient of a constant is zero. The intractable

Z_\theta

vanishes from the gradient field, and what remains — the score — is something a neural network can output directly: a vector field with the same shape as the input. A density

p(x)

is a map of where probability mass lives. The score

\nabla_x \log p(x)

is a compass that tells you, at any

x

, which direction increases log-density. You don’t need the map to navigate: a compass at every point is enough to roll downhill toward the data. _{Lecture: ≈ 18:00.}

The low-density trap

Two panels. Left: 'Structured regions (real data)' — a clean inward-pointing vector field around a sharp peak. Right: 'Vast noise plains' — gray, randomly-oriented arrows where the model has no training data; a generated particle is trapped, unable to find the data manifold.

Naive score matching has a fatal flaw. Training data

\{x_i\} \sim p_{\text{data}}

lives on (or near) a thin manifold inside an enormous ambient space. Score matching minimizes the expected squared error between

s_\theta(x)

and the true score, with the expectation taken under $p_{\text{data}}$ :

\mathcal{L}(\theta) \;=\; \mathbb{E}_{x \sim p_{\text{data}}}\!\left[\,\bigl\|\, s_\theta(x) - \nabla_x \log p_{\text{data}}(x) \,\bigr\|^2\right].

The model is only trained where the data lives. Far from the manifold the loss puts no weight at all, so the learned compass spins randomly there. At sampling time you start from a Gaussian sample — pure noise, with overwhelming probability far from the data — and the compass gives you no signal to follow back. You stay lost. This is the problem the rest of the lecture is built to solve. _{Lecture: ≈ 36:00.}

Noise as a bridge

Three time snapshots. At t=0 the probability landscape has three sharp spikes and the rest is empty. By t=1 noise has begun to spread the spikes outward. By t=2 the landscape is a smooth, broad hill with a continuous gradient everywhere — the analogue of ink dispersing into water.

The fix is physical. Inject Gaussian noise into the data and let it spread the probability mass outward over time. Sharp spikes broaden into smooth bumps, empty valleys fill in. After enough noise, the landscape has a non-zero, smoothly-varying gradient everywhere — and a network trained to match the score of the noisy distribution will have a well-defined target across the whole space. This is the non-equilibrium thermodynamics picture: a drop of dye in water starts as a localized concentration (

t=0

) and ends as a uniform mixture (

t \to \infty

). Score-based models reverse that diffusion. The forward process buys you a usable signal across the noise plain; the reverse process trades it back for samples on the data manifold. This is exactly the forward noising process you saw in the Brownian-motion section — eight trajectories spreading from the origin — restated in the language of densities rather than particles. _{Lecture: ≈ 38:00.}

The forward SDE

The forward stochastic differential equation dx = f(x,t) dt + g(t) dW, decomposed into the drift term f(x,t) (an inward pull preventing infinite spreading) and the diffusion term g(t) dW (Gaussian noise injection — a Wiener process).

In continuous time the noising process is a stochastic differential equation:

dx \;=\; f(x, t)\, dt \;+\; g(t)\, dW_t, \qquad t \in [0, T],

with

x_0 \sim p_{\text{data}}

and

W_t

a standard Wiener process.

Drift $f(x, t)$ : a deterministic pull. In Variance-Preserving (VP) SDEs the drift contracts $x$ toward the origin to stop variance from blowing up; in Variance-Exploding (VE) SDEs it is zero.
Diffusion $g(t)$ : a scalar volatility schedule controlling how fast Gaussian noise is injected.

DDPM is the discrete-time Variance-Preserving instance of this SDE, with the noise schedule

\beta_t

playing the role of

f

and

g

. The forward trajectory you trained against in the DDPM tutorial is one Euler-Maruyama discretization of exactly this SDE. _{Lecture: ≈ 1:00:00.}

Training the score network

Three columns comparing score-matching loss variants. Vanilla score matching computes the exact Jacobian trace — perfect accuracy, O(D) backprops, unscalable for images. Sliced score matching projects onto random 1D directions — O(1) cost via vector-Jacobian product, efficient but high variance. Denoising score matching adds noise and uses the analytical gradient of the perturbation kernel — O(1) cost, matches the SDE framework, the standard choice.

The implicit score-matching loss

\mathbb{E}_{p_{\text{data}}}\!\left[\,\tfrac{1}{2}\|s_\theta(x)\|^2 \;+\; \mathrm{tr}\!\left(\nabla_x s_\theta(x)\right)\right]

removes the unknown

\nabla \log p_{\text{data}}

via integration by parts, but the Jacobian trace requires

D

separate backward passes — fatal for images. Three workarounds:

Variant	Trick	Cost	Trade-off
Vanilla	Compute $\mathrm{tr}(\nabla s_\theta)$ exactly	$O(D)$ backprops	Exact, unusable above ~1k dims
Sliced	Random unit vectors $v$ ; replace trace by $v^\top (\nabla s_\theta)\, v$	$O(1)$ via Jacobian-vector product	Unbiased, higher variance
Denoising (DSM)	Perturb $x$ with kernel $q_\sigma(\tilde x \mid x)$ ; train against the known analytic score $\nabla_{\tilde x} \log q_\sigma(\tilde x \mid x)$	$O(1)$ , no Jacobians	Estimates the score of the noisy density, not the clean one

DSM is the standard choice. With Gaussian perturbation

\tilde x = x + \sigma \epsilon, \epsilon \sim \mathcal{N}(0, I)

\nabla_{\tilde x} \log q_\sigma(\tilde x \mid x) \;=\; -\frac{\tilde x - x}{\sigma^2} \;=\; -\frac{\epsilon}{\sigma},

so the DSM loss reduces to a re-weighted noise-prediction loss — this is exactly the DDPM training objective up to a $\sigma$ -dependent constant. The DDPM noise predictor

\epsilon_\theta

and the score network

s_\theta

are the same model with a different sign and scale. _{Lecture: ≈ 22:00–34:00 walks through all three variants.}

Reversing the SDE

The reverse SDE equation dx = [f(x,t) − g²(t) ∇ log p(x)] dt + g(t) dW̄, with the score term highlighted as 'our trained neural network model'. Below: a curved arrow leading from a noisy random-pixel image on the right back to a clean data sample on the left, with compass icons marking each integration step.

Anderson’s 1982 reversal theorem is the engine of generation. Every forward diffusion has a reverse-time partner:

dx \;=\; \bigl[\,f(x, t) \;-\; g(t)^2\, \nabla_x \log p_t(x)\,\bigr]\, dt \;+\; g(t)\, d\bar W_t,

where

d\bar W_t

is a Wiener process running backward in time and

p_t

is the marginal density at time

t

under the forward SDE. The only unknown on the right-hand side is

\nabla_x \log p_t(x)

— and that is precisely what the score network was trained to estimate. Substitute

s_\theta(x, t) \approx \nabla_x \log p_t(x)

and you have a generator: start from a sample of

p_T \approx \mathcal{N}(0, \sigma_T^2 I)

and integrate backward to

t = 0

to land on the data manifold. This is the same idea as annealed Langevin dynamics: at each

t

, take a step downhill along

s_\theta(\cdot, t)

plus a noise kick, and gradually decrease the noise level. The continuous-time SDE view subsumes the discrete annealed-Langevin sampler used in NCSN and the ancestral sampler used in DDPM. _{Lecture: ≈ 1:04:00.}

Two solvers: stochastic vs deterministic

Two panels comparing samplers on the same probability hill. Left, 'Reverse SDE (Langevin dynamics)': a jagged, randomized trajectory with the diffusion term retained — diverse samples but hundreds of slow, tiny steps for stability. Right, 'Probability flow ODE': a smooth deterministic curve descending the hill — drops the diffusion term, transports probability mass identically, requires roughly 20× fewer steps, and unlocks a uniquely identifiable latent space.

For every score-SDE there is a deterministic ODE that produces the same marginal densities

p_t

\frac{dx}{dt} \;=\; f(x, t) \;-\; \tfrac{1}{2}\, g(t)^2\, \nabla_x \log p_t(x).

This is the probability-flow ODE. The diffusion term is gone — only drift remains — but the family of distributions

\{p_t\}

swept out by the ODE is identical to the SDE’s. Two consequences:

Fewer steps. ODE solvers (Heun, DPM-Solver, RK45) converge in 20–50 NFEs versus the 500–1000 needed by reverse-SDE samplers. This is what production systems use.
A bijection between data and noise. The ODE is invertible. Each data point maps to a unique latent code in the prior, which gives you exact log-likelihoods (via the change-of-variables formula and Hutchinson’s estimator) and a meaningful semantic latent space — for free, after training.

The deterministic DDIM sampler (

\eta = 0

) you implemented in the DDIM tutorial is a first-order discretization of this ODE. _{Lecture: ≈ 1:07:00.}

Conditional generation via Bayes’ rule

The score decomposition ∇ log p(x|y) = ∇ log p(x) + ∇ log p(y|x), with the unconditional base model on the left, the forward model (condition) in the middle, and the final guided score on the right. Below, an example: sparse CT projections feeding into 'forward simulation model + prior', producing a high-fidelity reconstruction of an abdominal scan.

Conditioning is almost free in score space. Take logs and gradients of Bayes’ rule:

\nabla_x \log p(x \mid y) \;=\; \underbrace{\nabla_x \log p(x)}_{\text{unconditional score}} \;+\; \underbrace{\nabla_x \log p(y \mid x)}_{\text{condition / likelihood}}.

The unconditional score is your already-trained diffusion prior. The likelihood term comes from a forward model — a classifier, a degradation operator, a physics simulator. Add the two scores at every step of the reverse SDE/ODE and you sample from the posterior

p(x \mid y)

without retraining the prior. This single identity unifies a remarkable range of applications:

Inverse problems in medical imaging. $y$ is a sparse-view CT sinogram; $p(y \mid x)$ is the (linear, known) Radon transform plus measurement noise. The diffusion prior is a generic image model trained once, and the same prior reconstructs MRI, CT, and microscopy images.
Class-conditional generation. $p(y \mid x)$ is a classifier; classifier guidance (and its classifier-free cousin) is the same idea.
Text-to-image. $y$ is a text embedding; the conditional score is approximated jointly with the unconditional one in a single network.

_{Lecture: ≈ 1:14:00.}

Where score-based models fit

A 4×3 comparison table of generative model families. Diffusion (score-based) achieves state-of-the-art sample quality, high mode coverage, unconstrained U-Net architectures, and exact likelihoods via ODE solvers. GANs offer high quality but suffer mode collapse, demand a rigid generator-discriminator pair, and yield no likelihoods. VAEs cover modes well but produce blurry samples, accept only constrained encoder-decoder architectures, and report only approximate likelihoods.

The score-SDE framework inherits the strengths of earlier families and shares few of their weaknesses:

Sample quality rivals or exceeds GANs on benchmark image datasets (FID on CIFAR-10, ImageNet, LSUN).
Mode coverage is high: there is no adversary to collapse onto a few easy modes; the loss is a simple regression.
Architectural freedom matches GANs — any network that maps $\mathbb{R}^D \to \mathbb{R}^D$ works as a score model. No invertibility constraint, no encoder-decoder bottleneck.
Exact likelihoods are available via the probability-flow ODE — something GANs cannot offer at all and VAEs only bound.

The trade-off is sampling cost: even with ODE solvers, score-based generation is slower per sample than a single GAN forward pass. Distillation, consistency models, and flow matching are active research directions aimed at closing that gap.

End-to-end picture

A four-quadrant cycle. Phase 1, forward SDE: clean data points are spread into noise. Phase 2, score matching: a U-Net learns the vector field of the perturbed densities. The generation phase navigates backward through the reverse SDE / ODE. The output is high-fidelity synthesized data. The cycle closes: physical intuition (non-equilibrium thermodynamics) on one side, mathematical key (learning the vector field) on the other.

Putting the pieces together:

Forward SDE $dx = f(x, t)\, dt + g(t)\, dW_t$ smoothly noises clean data into a tractable prior.
Score matching trains $s_\theta(x, t)$ against the perturbation kernel via the DSM loss — equivalent to noise prediction up to scaling.
Reverse SDE / probability-flow ODE plugs the trained score back into the dynamics and integrates from $t = T$ to $t = 0$ .
Output: samples drawn from $p_{\text{data}}$ , with optional conditioning grafted on by additive scores.

DDPM, DDIM, NCSN, EDM, and modern latent-diffusion text-to-image systems are all instances of this same skeleton with different drift/diffusion choices, parameterizations of

s_\theta

, and ODE/SDE solvers.

Hands-on companions

Three sections elsewhere in this chapter let you exercise each piece of the framework on a low-dimensional problem you can plot:

Score on a mixture of Gaussians — derive and visualize $\nabla_x \log p(x)$ for a 2D MoG, then watch Langevin dynamics descend it.
Yang Song’s tutorial section — the official MNIST score-SDE walkthrough: VE-SDE, NCSN++, ancestral and Predictor-Corrector samplers, and the probability-flow ODE.
Brownian motion — eight trajectories of $dx = g(t)\, dW_t$ , the forward SDE in particle form.
DDPM on a 2D MoG — the discrete-time Variance-Preserving instance, with noise prediction.
DDIM on a 2D MoG — the deterministic ( $\eta = 0$ ) sampler, a first-order discretization of the probability-flow ODE.

References

Song, Sohl-Dickstein, Kingma, Kumar, Ermon, Poole. Score-Based Generative Modeling through Stochastic Differential Equations. ICLR 2021. arxiv.org/abs/2011.13456
Song, Ermon. Generative Modeling by Estimating Gradients of the Data Distribution. NeurIPS 2019. arxiv.org/abs/1907.05600
Vincent. A Connection Between Score Matching and Denoising Autoencoders. Neural Computation 2011.
Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 1982.
Hyvärinen. Estimation of non-normalized statistical models by score matching. JMLR 2005.
Song, Garg, Shi, Ermon. Sliced Score Matching: A Scalable Approach to Density and Score Estimation. UAI 2019. arxiv.org/abs/1905.07088

PyTorch reference

PyTorch class	Description
`nn.Conv2d`	Applies a 2D convolution over an input signal composed of several input planes.
`nn.ConvTranspose2d`	Applies a 2D transposed convolution operator over an input image composed of several input planes.
`nn.GroupNorm`	Applies Group Normalization over a mini-batch of inputs.
`nn.Linear`	Applies an affine linear transformation to the incoming data: $y = xA^T + b$ .
`nn.SiLU`	Applies the Sigmoid Linear Unit (SiLU) function, element-wise.

Edit this page on GitHub or file an issue.

​The intractable wall

​Trade the map for a compass: the score function

​The low-density trap

​Noise as a bridge

​The forward SDE

​Training the score network

​Reversing the SDE

​Two solvers: stochastic vs deterministic

​Conditional generation via Bayes’ rule

​Where score-based models fit

​End-to-end picture

​Hands-on companions

​References

​PyTorch reference

The intractable wall

Trade the map for a compass: the score function

The low-density trap

Noise as a bridge

The forward SDE

Training the score network

Reversing the SDE

Two solvers: stochastic vs deterministic

Conditional generation via Bayes’ rule

Where score-based models fit

End-to-end picture

Hands-on companions

References

PyTorch reference