Skip to main content

Diffusion models at a glance

A denoising diffusion model is built from two coupled Markov chains running in opposite directions over the same set of intermediate states x0,x1,,xTx_0, x_1, \ldots, x_T:
  • Forward diffusion process (fixed). Gradually corrupts a clean data point x0x_0 into pure Gaussian noise xTx_T by adding a small amount of Gaussian noise at each step. No learned parameters; specified entirely by a noise schedule βt\beta_t. Read it as a stack of fixed VAE encoders.
  • Reverse denoising process (learnable). Starts from pure noise xTN(0,I)x_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) and gradually denoises down to x0x_0. Read it as a stack of learnable VAE decoders — but with a single neural network shared across all TT timesteps and conditioned on tt.
Pictorially, both chains run over the same intermediate states: Forward q (fixed): gradually add noise Four-node chain x₀ → x₁ → ⋯ → x_T with solid arrows labeled q. Left node x₀ marked as data; right node x_T marked as noise. Editable Mermaid source: images/forward-chain.mermaid.md Filmstrip showing a clean cat photograph at x₀ on the left, gradually corrupted into pure visual noise at x_T on the right, with each intermediate frame x₁ through x₆ becoming progressively noisier. A horizontal arrow above the strip is labeled 'Forward diffusion process (fixed)' and the side labels read 'Data' on the left and 'Noise' on the right. Filmstrip from CMU 11-785, Lecture 24 (Diffusion), slide 24. Reverse pθ (learned): gradually denoise Four-node chain x_T → ⋯ → x₁ → x₀ with green dashed arrows labeled pθ. Left node x_T marked as noise; right node x₀ marked as sample. Editable Mermaid source: images/reverse-chain.mermaid.md Same cat-to-noise filmstrip as above, but now read right-to-left: starting from pure noise at x_T on the right, the network gradually denoises back to the clean cat photograph at x₀ on the left. A horizontal arrow above the strip points left and is labeled 'Reverse denoising process (generative)'. Filmstrip from CMU 11-785, Lecture 24 (Diffusion), slide 27. Generation = run the reverse chain. Training = teach the reverse chain to undo a step of the forward chain. The rest of this page builds both chains from a single mathematical primitive: Bishop’s linear-Gaussian theorem.

Bishop’s linear-Gaussian theorem applied to one diffusion step

A single noising step couples xt1x_{t-1} and xtx_t through a linear-Gaussian relationship, an instance of Bishop’s linear-Gaussian theorem on the gaussians prerequisite page. Recognizing the kernel as a special case of Bishop’s template lets us read off the marginal in closed form and gives a clean route to the multi-step closed-form jump q(xtx0)q(x_t \mid x_0) in the next section. Specialize the theorem to one DDPM step. Take x=xt1x = x_{t-1}, y=xty = x_t, with prior xt1N(μt1,Σt1)x_{t-1} \sim \mathcal{N}(\mu_{t-1}, \Sigma_{t-1}). The forward kernel is the linear-Gaussian conditional q(xtxt1)=N(xt1βtxt1,  βtI),q(x_t \mid x_{t-1}) = \mathcal{N}\left(x_t \mid \sqrt{1-\beta_t}\, x_{t-1},\; \beta_t I\right), which matches Bishop’s y=Ax+b+ε\mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{b} + \boldsymbol{\varepsilon} template with A=1βtI\mathbf{A} = \sqrt{1-\beta_t}\, I, b=0\mathbf{b} = 0, conditional precision L=βt1I\mathbf{L} = \beta_t^{-1} I, and prior precision Λ=Σt11\boldsymbol{\Lambda} = \Sigma_{t-1}^{-1}. Plugging into Bishop’s marginal formula gives the forward marginal q(xt)=N(xt1βtμt1,  βtI+(1βt)Σt1).q(x_t) = \mathcal{N}\left(x_t \mid \sqrt{1-\beta_t}\, \mu_{t-1},\; \beta_t I + (1-\beta_t)\, \Sigma_{t-1}\right). This is the workhorse. Every later result composes the forward kernel: the full forward chain q(x1:Tx0)q(x_{1:T} \mid x_0), the closed-form one-shot jump q(xtx0)q(x_t \mid x_0), and the forward posterior q(xt1xt,x0)q(x_{t-1} \mid x_t, x_0) that turns out to be the right reverse-direction target — exact in closed form for any data distribution, and the object the learned pθp_\theta will be trained to imitate. The next two sections build out the forward chain; after that the learned pθp_\theta is introduced.

The forward process: composing TT steps

The single-step kernel q(xtxt1)q(x_t \mid x_{t-1}), repeated tt times with a fixed schedule β1,,βT(0,1)\beta_1, \ldots, \beta_T \in (0, 1), defines the full forward chain: q(x1:Tx0)  =  t=1Tq(xtxt1).q(x_{1:T} \mid x_0) \;=\; \prod_{t=1}^{T} q(x_t \mid x_{t-1}). Closed-form jump from x0x_0 to any xtx_t. Because each step is linear-Gaussian and they compose, you can sample xtx_t in a single shot without simulating the intermediate states. Define αt=1βt\alpha_t = 1 - \beta_t and the cumulative product αˉt=s=1tαs\bar\alpha_t = \prod_{s=1}^t \alpha_s. Iterating Bishop’s per-step recursion across tt steps gives q(xtx0)=N ⁣(xt;  αˉtx0,  (1αˉt)I),xt=αˉtx0  +  1αˉtϵ,ϵN(0,I).\begin{aligned} q(x_t \mid x_0) &= \mathcal{N}\!\left(x_t;\; \sqrt{\bar\alpha_t}\, x_0,\; (1 - \bar\alpha_t)\, \mathbf{I}\right), \\ x_t &= \sqrt{\bar\alpha_t}\, x_0 \;+\; \sqrt{1 - \bar\alpha_t}\,\epsilon, \quad \epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I}). \end{aligned} This identity is what makes DDPM training cheap: at every gradient step you draw a random tt and compute xtx_t directly from x0x_0 rather than rolling out the chain. The schedule is designed so that αˉT0\bar\alpha_T \to 0, which makes q(xTx0)N(0,I)q(x_T \mid x_0) \approx \mathcal{N}(\mathbf{0}, \mathbf{I}) regardless of x0x_0. The endpoint of the forward chain is therefore (approximately) data-independent pure noise — the same prior the reverse chain will start from.

The reverse process: parameterizing what qq hides

To generate, we want to draw xTN(0,I)x_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) and then iteratively sample xt1    q(xt1xt),t=T,T1,,1.x_{t-1} \;\sim\; q(x_{t-1} \mid x_t), \qquad t = T, T-1, \ldots, 1. The catch: q(xt1xt)q(x_{t-1} \mid x_t) as a function of xtx_t alone is not directly tractable at sampling time. Even though xt1x_{t-1} and xtx_t are jointly distributed under the forward joint, computing this conditional requires marginalizing over the data distribution q(x0)q(x_0) — which is precisely what we are trying to model. DDPM’s modeling choice: approximate the reverse step with a learned Gaussian, pθ(xt1xt)  =  N ⁣(xt1;  μθ(xt,t),  σt2I).p_\theta(x_{t-1} \mid x_t) \;=\; \mathcal{N}\!\left(x_{t-1};\; \mu_\theta(x_t, t),\; \sigma_t^2\, \mathbf{I}\right). Two reasons this functional form is reasonable:
  1. Small-βt\beta_t limit. When each forward step injects only a little noise, the true reverse conditional q(xt1xt)q(x_{t-1} \mid x_t) — even with a non-Gaussian data distribution — is well-approximated by a Gaussian (Sohl-Dickstein et al. 2015). That is the local justification for choosing a Gaussian functional form for pθp_\theta.
  2. Conditioned on x0x_0, qq is genuinely Gaussian. A close cousin, q(xt1xt,x0)q(x_{t-1} \mid x_t, x_0), is exactly Gaussian for any data distribution q(x0)q(x_0) — Bayes’ rule on the forward joint with x0x_0 as a fixed parameter (full derivation in the Notes on the reverse conditional deep-dive below). Training will use this two-conditional form (which has x0x_0 available) as the target the learned pθp_\theta should match.
Two notes on the parameters:
  • μθ(xt,t)\mu_\theta(x_t, t) is the only learned quantity. A single neural network produces it for every tt, with tt fed in through a sinusoidal time embedding so the same weights handle every noise level.
  • σt2\sigma_t^2 is typically fixed by the schedule, commonly either βt\beta_t or the DDPM posterior variance β~t\tilde\beta_t from q(xt1xt,x0)q(x_{t-1} \mid x_t, x_0), so the variance is not learned in the basic DDPM.
The full reverse joint factorizes as pθ(x0:T)  =  p(xT)t=1Tpθ(xt1xt),p(xT)=N(0,I),p_\theta(x_{0:T}) \;=\; p(x_T)\, \prod_{t=1}^{T} p_\theta(x_{t-1} \mid x_t), \qquad p(x_T) = \mathcal{N}(\mathbf{0}, \mathbf{I}), and sampling is the obvious thing: draw xTx_T from the standard Gaussian, then walk one step at a time toward x0x_0.

Notes on the reverse conditional

With both chains in place, three observations are worth flagging — one notational, two structural. The deep-dive at the end is safe to skip on a first reading.
  1. Notation: what qq does and does not mean. The symbol qq does not tag a direction in time. The forward process defines a joint distribution q(x0,x1,,xT)=q(x0)t=1Tq(xtxt1)q(x_0, x_1, \ldots, x_T) = q(x_0) \prod_{t=1}^{T} q(x_t \mid x_{t-1}), and anything you can compute from that joint stays inside the q-family. Both q(xtxt1)q(x_t \mid x_{t-1}) and its time-reversal q(xt1xt)q(x_{t-1} \mid x_t) live in it. But only some of these conditionals stay analytically simple after marginalizing over the data distribution: q(xtxt1)q(x_t \mid x_{t-1}) is fixed and explicit by construction, while q(xt1xt)q(x_{t-1} \mid x_t) is generally not available in closed form unless extra Gaussian assumptions are imposed. The contrast is with pθ(xt1xt)p_\theta(x_{t-1} \mid x_t), which is the learned generative model used at sampling time, where x0x_0 (and therefore the data distribution) is no longer available.
  2. What pθp_\theta targets. The training target the learned pθ(xt1xt)p_\theta(x_{t-1} \mid x_t) is fitted against is the forward posterior q(xt1xt,x0)q(x_{t-1} \mid x_t, x_0), which is exactly Gaussian for any data distribution (see the deep-dive below). At sampling time x0x_0 is unavailable, so the network’s job is to predict what that posterior would have said using only xtx_t and tt.
  3. Variance accumulates predictably. Each forward step adds βtI\beta_t I to a (1βt)(1-\beta_t)-shrunken copy of Σt1\Sigma_{t-1}. The closed-form jump q(xtx0)=N(αˉtx0,(1αˉt)I)q(x_t \mid x_0) = \mathcal{N}(\sqrt{\bar\alpha_t}\, x_0,\, (1 - \bar\alpha_t)\, I) in the forward-process section is precisely the iteration of this rule across TT steps.
The naive reverse conditional q(xt1xt)q(x_{t-1} \mid x_t) — conditioning only on the current noisy state — is not in closed form for an arbitrary data distribution: it requires marginalizing over q(x0)q(x_0), which is exactly what we are trying to model. The standard DDPM derivation (Ho et al. 2020, eq. 6-7) sidesteps that by working with q(xt1xt,x0)q(x_{t-1} \mid x_t, x_0) — conditioning on both the current noisy state and the original clean sample — which is exactly Gaussian for any q(x0)q(x_0).The two-conditional form is Gaussian for any data distribution q(x0)q(x_0). Apply Bayes’ rule on the forward joint:q(xt1xt,x0)  =  q(xtxt1,x0)q(xt1x0)q(xtx0).q(x_{t-1} \mid x_t, x_0) \;=\; \frac{q(x_t \mid x_{t-1}, x_0)\, q(x_{t-1} \mid x_0)}{q(x_t \mid x_0)}.By the Markov property of the forward chain, q(xtxt1,x0)=q(xtxt1)q(x_t \mid x_{t-1}, x_0) = q(x_t \mid x_{t-1}) is the Gaussian noising kernel. The other two terms are the closed-form jumps from x0x_0 derived in the forward-process section, both Gaussian. The product/quotient of three Gaussians (in xt1x_{t-1}) is again Gaussian. The data distribution q(x0)q(x_0) never enters because x0x_0 is a fixed parameter here, not a random variable being marginalized over — its value just shifts the means.By contrast, q(xt1xt)q(x_{t-1} \mid x_t) without x0x_0 marginalizes x0x_0 out:q(xt1xt)  =  q(xt1xt,x0)q(x0xt)dx0,q(x_{t-1} \mid x_t) \;=\; \int q(x_{t-1} \mid x_t, x_0)\, q(x_0 \mid x_t)\, dx_0,a mixture of Gaussians weighted by the posterior q(x0xt)q(x_0 \mid x_t), generally non-Gaussian unless q(x0)q(x_0) itself is Gaussian. The two-conditional form q(xt1xt,x0)q(x_{t-1} \mid x_t, x_0) avoids this entirely: with x0x_0 as a fixed parameter, the data distribution drops out of the algebra and the closed form holds regardless of q(x0)q(x_0). That is the forward posterior that appears in the ELBO derivation below — at training time you have x0x_0 available, and the result is a Gaussian whose mean and variance are explicit functions of (xt,x0,t)(x_t, x_0, t) alone.
Consider a toy diffusion model with only three noising steps:x0    x1    x2    x3.x_0 \;\to\; x_1 \;\to\; x_2 \;\to\; x_3.x0x_0 is a clean data sample (for example, an image) and x3x_3 is almost Gaussian noise. Pick a forward noise scheduleβ1,β2,β3(0,1),\beta_1, \beta_2, \beta_3 \in (0, 1),and define αt=1βt\alpha_t = 1 - \beta_t together with the cumulative productαˉt=s=1tαs,soαˉ1=α1,αˉ2=α1α2,αˉ3=α1α2α3.\bar\alpha_t = \prod_{s=1}^{t} \alpha_s, \qquad\text{so}\qquad \bar\alpha_1 = \alpha_1,\quad \bar\alpha_2 = \alpha_1 \alpha_2,\quad \bar\alpha_3 = \alpha_1 \alpha_2 \alpha_3.Forward process. The forward process is fixed: it gradually corrupts the data byq(xtxt1)=N(xt;  αtxt1,  βtI).q(x_t \mid x_{t-1}) = \mathcal{N}\left(x_t;\; \sqrt{\alpha_t}\, x_{t-1},\; \beta_t I\right).For the three steps,q(x1x0)=N(x1;  α1x0,  β1I),q(x_1 \mid x_0) = \mathcal{N}\left(x_1;\; \sqrt{\alpha_1}\, x_0,\; \beta_1 I\right),q(x2x1)=N(x2;  α2x1,  β2I),q(x_2 \mid x_1) = \mathcal{N}\left(x_2;\; \sqrt{\alpha_2}\, x_1,\; \beta_2 I\right),q(x3x2)=N(x3;  α3x2,  β3I).q(x_3 \mid x_2) = \mathcal{N}\left(x_3;\; \sqrt{\alpha_3}\, x_2,\; \beta_3 I\right).Equivalently, you can sample any noisy point directly from x0x_0:q(xtx0)=N(xt;  αˉtx0,  (1αˉt)I),q(x_t \mid x_0) = \mathcal{N}\left(x_t;\; \sqrt{\bar\alpha_t}\, x_0,\; (1 - \bar\alpha_t)\, I\right),so the final noisy sample isx3=αˉ3x0+1αˉ3ϵ,ϵN(0,I).x_3 = \sqrt{\bar\alpha_3}\, x_0 + \sqrt{1 - \bar\alpha_3}\, \epsilon, \qquad \epsilon \sim \mathcal{N}(0, I).Reverse process. The reverse process tries to undo the corruption: x3x2x1x0x_3 \to x_2 \to x_1 \to x_0. Parameterize the learned reverse kernel aspθ(xt1xt)=N(xt1;  μθ(xt),  Σθ(xt)),p_\theta(x_{t-1} \mid x_t) = \mathcal{N}\left(x_{t-1};\; \mu_\theta(x_t),\; \Sigma_\theta(x_t)\right),instantiated at the three steps as pθ(x2x3)p_\theta(x_2 \mid x_3), pθ(x1x2)p_\theta(x_1 \mid x_2), pθ(x0x1)p_\theta(x_0 \mid x_1).A note on notation: μθ(xt)\mu_\theta(x_t) and ϵθ(xt)\epsilon_\theta(x_t) are written with a single argument because tt is already pinned down by the subscript on xtx_t. In code, μθ\mu_\theta and ϵθ\epsilon_\theta are a single shared network used at every timestep, conditioned on tt through a learned time embedding; the tt argument is implicit in the input typing.In DDPM the network predicts the noise ϵθ(xt)\epsilon_\theta(x_t) that was added. Plugging that into the reverse mean givesμθ(xt)=1αt(xtβt1αˉtϵθ(xt)).\mu_\theta(x_t) = \frac{1}{\sqrt{\alpha_t}}\left( x_t - \frac{\beta_t}{\sqrt{1 - \bar\alpha_t}}\, \epsilon_\theta(x_t) \right).Starting from x3N(0,I)x_3 \sim \mathcal{N}(0, I), the three reverse transitions areμθ(x3)=1α3(x3β31αˉ3ϵθ(x3)),x2N(μθ(x3),  Σθ(x3)),\mu_\theta(x_3) = \frac{1}{\sqrt{\alpha_3}}\left( x_3 - \frac{\beta_3}{\sqrt{1 - \bar\alpha_3}}\, \epsilon_\theta(x_3) \right), \qquad x_2 \sim \mathcal{N}\left(\mu_\theta(x_3),\; \Sigma_\theta(x_3)\right),μθ(x2)=1α2(x2β21αˉ2ϵθ(x2)),x1N(μθ(x2),  Σθ(x2)),\mu_\theta(x_2) = \frac{1}{\sqrt{\alpha_2}}\left( x_2 - \frac{\beta_2}{\sqrt{1 - \bar\alpha_2}}\, \epsilon_\theta(x_2) \right), \qquad x_1 \sim \mathcal{N}\left(\mu_\theta(x_2),\; \Sigma_\theta(x_2)\right),μθ(x1)=1α1(x1β11αˉ1ϵθ(x1)),x0N(μθ(x1),  Σθ(x1)).\mu_\theta(x_1) = \frac{1}{\sqrt{\alpha_1}}\left( x_1 - \frac{\beta_1}{\sqrt{1 - \bar\alpha_1}}\, \epsilon_\theta(x_1) \right), \qquad x_0 \sim \mathcal{N}\left(\mu_\theta(x_1),\; \Sigma_\theta(x_1)\right).The essential idea: the forward chain adds known Gaussian noise, and the reverse chain learns how much noise to remove. Putting both directions on the same picture,x0qx1qx2qx3,x3N(0,I)pθx2pθx1pθx0.x_0 \xrightarrow{q} x_1 \xrightarrow{q} x_2 \xrightarrow{q} x_3, \qquad x_3 \sim \mathcal{N}(0, I) \xrightarrow{p_\theta} x_2 \xrightarrow{p_\theta} x_1 \xrightarrow{p_\theta} x_0.

What the network outputs

The reverse-step distribution is parameterized as pθ(xt1xt)=N ⁣(xt1;μθ(xt,t),σt2I),p_\theta(x_{t-1} \mid x_t) = \mathcal{N}\!\left(x_{t-1};\, \mu_\theta(x_t, t),\, \sigma_t^2\, \mathbf{I}\right), but what the neural network literally computes is a design choice. There are three algebraically equivalent options:
  1. Predict the mean directlyμθ(xt,t)\mu_\theta(x_t, t). The most direct read of the parameterization above; the network output is a vector with the shape of xtx_t, used as the mean of the reverse Gaussian.
  2. Predict the noiseϵθ(xt,t)\epsilon_\theta(x_t, t). The network outputs a vector with the shape of xtx_t that estimates the noise that was added when forming xtx_t from x0x_0 via the closed-form jump. The reverse-step mean is then derived analytically: μθ(xt,t)=1αt ⁣(xtβt1αˉtϵθ(xt,t)).\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}}\!\left( x_t - \frac{\beta_t}{\sqrt{1 - \bar\alpha_t}}\, \epsilon_\theta(x_t, t) \right).
  3. Predict the clean datax^0,θ(xt,t)\hat{x}_{0,\theta}(x_t, t). The network outputs an estimate of the original x0x_0. The reverse-step mean is derived from the forward posterior q(xt1xt,x0)q(x_{t-1} \mid x_t, x_0) with the network’s x^0\hat{x}_0 plugged in for the unknown x0x_0.
These three are interchangeable under the reparameterization xt=αˉtx0+1αˉtϵx_t = \sqrt{\bar\alpha_t}\, x_0 + \sqrt{1 - \bar\alpha_t}\, \epsilon:
  • Given (xt,ϵ^)(x_t, \hat\epsilon): solve for x^0=(xt1αˉtϵ^)/αˉt\hat{x}_0 = (x_t - \sqrt{1 - \bar\alpha_t}\, \hat\epsilon) \,/\, \sqrt{\bar\alpha_t}.
  • Given (xt,x^0)(x_t, \hat{x}_0): solve for ϵ^=(xtαˉtx^0)/1αˉt\hat\epsilon = (x_t - \sqrt{\bar\alpha_t}\, \hat{x}_0) \,/\, \sqrt{1 - \bar\alpha_t}.
DDPM picks ε-prediction. Three reasons make it empirically dominant:
  • The training target ϵN(0,I)\epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) has fixed scale at every tt, so the regression problem is well-conditioned across all noise levels.
  • The corresponding training loss collapses to a plain unweighted MSE — the LsimpleL_\text{simple} derived in the next section.
  • The network’s job becomes a single intuitive task: “look at this noisy data and tell me the noise that was added”.
Concretely, one forward pass of the DDPM network looks like: Top row: two input boxes, x_t (noisy input) on the left and t (timestep) on the right. The t input flows down into a sinusoidal time-embedding box. Both x_t and the time embedding flow into a central ε_θ neural net box (highlighted green to indicate trainable, labeled shared across all t, MLP / U-Net / DiT). The neural net flows down to an output box ε̂ predicted noise, same shape as x_t. Editable Mermaid source: images/network-io.mermaid.md The data input dimension is whatever xtx_t has (2 for the MoG example, 3×H×W for images). The timestep enters through a sinusoidal time embedding (the original DDPM lifts Vaswani’s positional-encoding formula and feeds the integer tt into it) so a single set of weights handles every noise level. At sampling time, the predicted noise gets plugged into the analytic reverse-step mean above, and a small amount of fresh Gaussian noise is added (controlled by σt2\sigma_t^2) to draw xt1x_{t-1}. The next section derives why training this network on the noise-prediction MSE is exactly the right loss to maximize the data likelihood.

Training objective: from VAE ELBO to the DDPM loss

You now have a fixed forward chain qq and a parameterized reverse chain pθp_\theta. The remaining question is what loss to train θ\theta on. The answer is the same Evidence Lower Bound (ELBO) derived for VAEs (see Optimization and the ELBO for the single-latent derivation), applied here to a deep, Markov latent chain whose encoder happens to be fixed. DDPM is a hierarchical VAE with two simplifying choices: the latent is a Markov chain x1:T\mathbf x_{1:T}, and the encoder is fixed, namely hand-designed Gaussian noise injection with no learnable parameters. Hierarchical here means a stack of TT latents x1,,xT\mathbf x_1, \ldots, \mathbf x_T rather than the single latent z\mathbf z from the basic VAE architecture. A generic hierarchical VAE (NVAE, ladder-VAE, ResNet-VAE) learns both the top-down decoder pθ(xt1xt)p_\theta(\mathbf x_{t-1} \mid \mathbf x_t) and a bottom-up encoder qϕ(xtxt1,x0)q_\phi(\mathbf x_t \mid \mathbf x_{t-1}, \mathbf x_0) at every level; DDPM keeps the top-down chain learnable and freezes the bottom-up chain to a fixed Gaussian schedule.
VAEDDPM
Single latent z\mathbf zChain x1:T=(x1,,xT)\mathbf x_{1:T} = (\mathbf x_1, \ldots, \mathbf x_T)
Encoder qϕ(zx)q_\phi(\mathbf z \mid \mathbf x), learnedForward process q(x1:Tx0)=tq(xtxt1)q(\mathbf x_{1:T} \mid \mathbf x_0) = \prod_t q(\mathbf x_t \mid \mathbf x_{t-1}), fixed Gaussians
Decoder pθ(xz)p_\theta(\mathbf x \mid \mathbf z)Reverse chain pθ(x0:T)=p(xT)tpθ(xt1xt)p_\theta(\mathbf x_{0:T}) = p(\mathbf x_T) \prod_t p_\theta(\mathbf x_{t-1} \mid \mathbf x_t)
Prior p(z)=N(0,I)p(\mathbf z) = \mathcal N(\mathbf 0, \mathbf I)Endpoint p(xT)=N(0,I)p(\mathbf x_T) = \mathcal N(\mathbf 0, \mathbf I)
Substitute these into the joint-form VAE ELBO and you get exactly DDPM’s eq. (3): logpθ(x0)    Eq(x1:Tx0)[logpθ(x0:T)q(x1:Tx0)]  =:  L.\log p_\theta(\mathbf x_0) \;\ge\; \mathbb{E}_{q(\mathbf x_{1:T} | \mathbf x_0)}\left[\log \frac{p_\theta(\mathbf x_{0:T})}{q(\mathbf x_{1:T} | \mathbf x_0)}\right] \;=:\; -L. The DDPM training loss LL is the negative VAE ELBO with the single latent replaced by the entire noise chain.

A telescoping decomposition

Both chains are Markov, so the joint distributions factorize across tt. Applying Bayes’ rule to rewrite each forward step in terms of the forward posterior q(xt1xt,x0)q(\mathbf x_{t-1} | \mathbf x_t, \mathbf x_0) telescopes the bound into three pieces (Ho et al., eq. 5): L  =  Eq[KL(q(xTx0)p(xT))LT  +  t>1KL(q(xt1xt,x0)pθ(xt1xt))Lt1    logpθ(x0x1)L0]L \;=\; \mathbb{E}_q\left[\,\underbrace{KL\left(q(\mathbf x_T | \mathbf x_0) \,\|\, p(\mathbf x_T)\right)}_{L_T} \;+\; \sum_{t > 1} \underbrace{KL\left(q(\mathbf x_{t-1} | \mathbf x_t, \mathbf x_0) \,\|\, p_\theta(\mathbf x_{t-1} | \mathbf x_t)\right)}_{L_{t-1}} \;-\; \underbrace{\log p_\theta(\mathbf x_0 | \mathbf x_1)}_{L_0}\,\right] Each piece has a direct VAE counterpart:
  • L0=logpθ(x0x1)L_0 = -\log p_\theta(\mathbf x_0 | \mathbf x_1) is the reconstruction term, playing the same role as logpθ(xz)-\log p_\theta(\mathbf x | \mathbf z) in the single-latent VAE, just at the bottom rung of the chain instead of after one decoder pass.
  • LT=KL(q(xTx0)p(xT))L_T = KL\left(q(\mathbf x_T | \mathbf x_0) \,\|\, p(\mathbf x_T)\right) is the prior-matching term, the same role as KL(qϕ(zx)p(z))KL(q_\phi(\mathbf z | \mathbf x) \,\|\, p(\mathbf z)) in the VAE. Because the forward process is fixed and the schedule βt\beta_t is chosen so that q(xTx0)N(0,I)q(\mathbf x_T | \mathbf x_0) \to \mathcal N(\mathbf 0, \mathbf I) for large TT, this term has no parameters to optimize: it is approximately constant during training, and DDPM drops it from the loss.
  • Lt1L_{t-1} are the per-step transition KLs between the analytically tractable forward posterior q(xt1xt,x0)q(\mathbf x_{t-1} | \mathbf x_t, \mathbf x_0), whose closed form was derived in Bishop’s linear-Gaussian theorem applied to one diffusion step near the top of this page, and the learned reverse step pθ(xt1xt)p_\theta(\mathbf x_{t-1} | \mathbf x_t). These have no counterpart in a single-latent VAE; they appear because the chain has TT rungs.

Why this is operationally simpler than a hierarchical VAE

Fixing the encoder buys two simplifications that ordinary VAEs cannot exploit:
  1. No encoder gradient. The forward process has no ϕ\phi. The bound-tightening role the encoder plays in a VAE (gradients on ϕ\phi minimizing the posterior KL) disappears entirely. Training is a single-network problem in θ\theta.
  2. All Lt1L_{t-1} terms share parameters. In the standard fixed-variance DDPM setup, both q(xt1xt,x0)q(\mathbf x_{t-1} | \mathbf x_t, \mathbf x_0) and pθ(xt1xt)p_\theta(\mathbf x_{t-1} | \mathbf x_t) are Gaussians with matched prescribed covariance, so the learnable part is the mean. Combining the reparameterization xt=αˉtx0+1αˉtϵ\mathbf x_t = \sqrt{\bar\alpha_t}\, \mathbf x_0 + \sqrt{1-\bar\alpha_t}\, \boldsymbol\epsilon from the worked example above with the noise-prediction parameterization μθ(xt)  =  1αt(xtβt1αˉtϵθ(xt))\mu_\theta(\mathbf x_t) \;=\; \frac{1}{\sqrt{\alpha_t}}\left(\mathbf x_t - \frac{\beta_t}{\sqrt{1 - \bar\alpha_t}}\, \boldsymbol\epsilon_\theta(\mathbf x_t)\right) collapses each Lt1L_{t-1}, up to a tt-dependent weight, to a noise-prediction MSE (Ho et al., §3.2): Lt1    Ex0,ϵ[ϵϵθ(αˉtx0+1αˉtϵ,t)2].L_{t-1} \;\propto\; \mathbb{E}_{\mathbf x_0, \boldsymbol\epsilon}\left[\,\bigl\|\boldsymbol\epsilon \,-\, \boldsymbol\epsilon_\theta\bigl(\sqrt{\bar\alpha_t}\, \mathbf x_0 + \sqrt{1-\bar\alpha_t}\, \boldsymbol\epsilon,\, t\bigr)\bigr\|^2\,\right]. Dropping the tt-dependent weight gives the simple training objective LsimpleL_{\text{simple}} that produced DDPM’s image-quality results.
When you train DDPM on the noise-prediction MSE in the DDPM MoG example, you are optimizing a re-weighted version of the same VAE ELBO, applied to a TT-deep latent chain whose encoder you decided in advance instead of learning. The score-based methods page reaches the same training loss from a different door (denoising score matching), and the equivalence between the noise predictor ϵθ\boldsymbol\epsilon_\theta and the score xlogpt(x)\nabla_{\mathbf x} \log p_t(\mathbf x) is what unifies both views.

References

  1. Kingma, Welling. Auto-Encoding Variational Bayes. ICLR 2014. arxiv.org/abs/1312.6114
  2. Ho, Jain, Abbeel. Denoising Diffusion Probabilistic Models. NeurIPS 2020. arxiv.org/abs/2006.11239
  3. Sohl-Dickstein et al. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. ICML 2015. arxiv.org/abs/1503.03585
  4. Song et al. Score-Based Generative Modeling through Stochastic Differential Equations. ICLR 2021. arxiv.org/abs/2011.13456