Skip to main content
The VAE architecture introduces an inference network q(zx;ϕ)q(\mathbf z | \mathbf x ; \mathbf \phi) that approximates the intractable true posterior p(zx;θ)p(\mathbf z | \mathbf x ; \mathbf \theta). Training jointly the encoder parameters ϕ\mathbf \phi and the decoder parameters θ\mathbf \theta requires an objective that is tractable yet aligned with maximizing the marginal log-likelihood logp(x;θ)\log p(\mathbf x ; \mathbf \theta). The Evidence Lower Bound (ELBO) is exactly that objective.

From KL divergence to the ELBO

During the treatment of entropy, we have met the concept of relative entropy or KL divergence that measures the “distance” between two distributions referenced on one of them. KL(qp)  =  Eq ⁣[logq(x)logp(x)]  =  xq(x)logp(x)q(x)KL(q \,\|\, p) \;=\; \mathbb{E}_q\!\left[\log q(\mathbf x) - \log p(\mathbf x)\right] \;=\; -\sum_{\mathbf x} q(\mathbf x) \log \frac{p(\mathbf x)}{q(\mathbf x)} We will use KL divergence to obtain a suitable loss function that will be used in the optimization of this approximation via the DNNencDNN_{enc} network. Ultimately we are trying to minimize the KL divergence between the true posterior p(zx;θ)p(\mathbf z| \mathbf x ; \mathbf \theta) and the approximate posterior q(zx;ϕ)q(\mathbf z | \mathbf x ; \mathbf \phi): KL(q(zx;ϕ)p(zx;θ))  =  zq(zx;ϕ)logp(zx;θ)q(zx;ϕ)KL(q(\mathbf z | \mathbf x ; \mathbf \phi) \,\|\, p(\mathbf z | \mathbf x; \mathbf \theta)) \;=\; -\sum_{\mathbf z} q(\mathbf z | \mathbf x ; \mathbf \phi) \log \frac{p(\mathbf z | \mathbf x; \mathbf \theta)}{q(\mathbf z | \mathbf x ; \mathbf \phi)} Applying Bayes’ rule to replace the posterior p(zx;θ)=p(z,x;θ)p(x;θ)p(\mathbf z | \mathbf x; \mathbf \theta) = \frac{p(\mathbf z, \mathbf x; \mathbf \theta)}{p(\mathbf x; \mathbf \theta)}: =zq(zx;ϕ)logp(z,x;θ)q(zx;ϕ)p(x;θ)= -\sum_{\mathbf z} q(\mathbf z | \mathbf x ; \mathbf \phi) \log \frac{p(\mathbf z , \mathbf x; \mathbf \theta)}{q(\mathbf z | \mathbf x ; \mathbf \phi) \, p(\mathbf x; \mathbf \theta)} Separating the log of the product: =zq(zx;ϕ)[logp(z,x;θ)q(zx;ϕ)logp(x;θ)]= -\sum_{\mathbf z} q(\mathbf z | \mathbf x ; \mathbf \phi) \left[ \log \frac{p(\mathbf z , \mathbf x; \mathbf \theta)}{q(\mathbf z | \mathbf x ; \mathbf \phi)} - \log p(\mathbf x; \mathbf \theta) \right] Distributing the sum: =zq(zx;ϕ)logp(z,x;θ)q(zx;ϕ)+zq(zx;ϕ)logp(x;θ)= -\sum_{\mathbf z} q(\mathbf z | \mathbf x ; \mathbf \phi) \log \frac{p(\mathbf z , \mathbf x; \mathbf \theta)}{q(\mathbf z | \mathbf x ; \mathbf \phi)} + \sum_{\mathbf z} q(\mathbf z | \mathbf x ; \mathbf \phi) \log p(\mathbf x; \mathbf \theta) Since logp(x;θ)\log p(\mathbf x; \mathbf \theta) does not depend on z\mathbf z, it can be pulled out of the sum. And since qq is a valid distribution, zq(zx;ϕ)=1\sum_{\mathbf z} q(\mathbf z | \mathbf x ; \mathbf \phi) = 1: =zq(zx;ϕ)logp(z,x;θ)q(zx;ϕ)+logp(x;θ)= -\sum_{\mathbf z} q(\mathbf z | \mathbf x ; \mathbf \phi) \log \frac{p(\mathbf z , \mathbf x; \mathbf \theta)}{q(\mathbf z | \mathbf x ; \mathbf \phi)} + \log p(\mathbf x; \mathbf \theta) Rearranging: logp(x;θ)  =  KL ⁣(q(zx;ϕ)p(zx;θ))  +  zq(zx;ϕ)logp(z,x;θ)q(zx;ϕ)L(ϕ,θ)=Evidence Lower Bound (ELBO)\Rightarrow \log p(\mathbf x; \mathbf \theta) \;=\; KL\!\left(q(\mathbf z | \mathbf x ; \mathbf \phi) \,\|\, p(\mathbf z | \mathbf x; \mathbf \theta)\right) \;+\; \underbrace{\sum_{\mathbf z} q(\mathbf z | \mathbf x ; \mathbf \phi) \log \frac{p(\mathbf z , \mathbf x; \mathbf \theta)}{q(\mathbf z | \mathbf x ; \mathbf \phi)}}_{\mathcal L(\phi, \theta) \,=\, \text{Evidence Lower Bound (ELBO)}} The bracketed quantity L(ϕ,θ)\mathcal L(\phi, \theta) is the Evidence Lower Bound (ELBO). It is a function of both the encoder parameters ϕ\phi (through qϕq_\phi) and the decoder parameters θ\theta (through the joint pθ(z,x)p_\theta(\mathbf z, \mathbf x)).

Why is it a lower bound?

The KL divergence is non-negative, KL ⁣(q(zx;ϕ)p(zx;θ))    0,KL\!\left(q(\mathbf z | \mathbf x ; \mathbf \phi) \,\|\, p(\mathbf z | \mathbf x; \mathbf \theta)\right) \;\ge\; 0, with equality if and only if q(zx;ϕ)=p(zx;θ)q(\mathbf z | \mathbf x ; \mathbf \phi) = p(\mathbf z | \mathbf x; \mathbf \theta) almost everywhere. Dropping the KL term from the equality above therefore turns it into an inequality: logp(x;θ)    zq(zx;ϕ)logp(z,x;θ)q(zx;ϕ)L(ϕ,θ)\log p(\mathbf x; \mathbf \theta) \;\ge\; \underbrace{\sum_{\mathbf z} q(\mathbf z | \mathbf x ; \mathbf \phi) \log \frac{p(\mathbf z , \mathbf x; \mathbf \theta)}{q(\mathbf z | \mathbf x ; \mathbf \phi)}}_{\mathcal L(\phi, \theta)} The KL is the gap between the bound and the true marginal log-likelihood. The closer the encoder approximates the true posterior, the tighter the bound.

A practical form: reconstruction minus KL

The joint p(z,x;θ)p(\mathbf z, \mathbf x; \mathbf \theta) factors as p(xz;θ)p(z)p(\mathbf x | \mathbf z ; \mathbf \theta)\, p(\mathbf z), where p(z)p(\mathbf z) is the (fixed, parameter-free) prior — typically N(0,I)\mathcal N(\mathbf 0, \mathbf I). Substituting and expanding the log: L(ϕ,θ)  =  Eqϕ(zx) ⁣[logp(xz;θ)]reconstruction term    KL ⁣(qϕ(zx)p(z))regularizer\mathcal L(\phi, \theta) \;=\; \underbrace{\mathbb{E}_{q_\phi(\mathbf z | \mathbf x)}\!\left[\log p(\mathbf x | \mathbf z; \mathbf \theta)\right]}_{\text{reconstruction term}} \;-\; \underbrace{KL\!\left(q_\phi(\mathbf z | \mathbf x) \,\|\, p(\mathbf z)\right)}_{\text{regularizer}} This is the form actually computed in code:
  • The reconstruction term is the expected log-likelihood of the observed data x\mathbf x under the decoder when z\mathbf z is drawn from the encoder. Maximizing it pushes the decoder to put high probability on real data.
  • The KL regularizer pulls the encoder’s posterior toward the prior p(z)p(\mathbf z). This prevents the encoder from collapsing into a different point distribution for every datapoint and is what makes the latent space dense and continuous.
Note carefully which KL is which: this regularizer uses the prior p(z)p(\mathbf z), not the true posterior p(zx;θ)p(\mathbf z | \mathbf x; \mathbf \theta). The KL against the true posterior is the (unobservable) bound gap from the previous section; the KL against the prior is part of the loss you actually optimize.

Why is the ELBO useful for optimization?

Re-arrange the identity to put the ELBO on one side: L(ϕ,θ)  =  logp(x;θ)    KL ⁣(q(zx;ϕ)p(zx;θ))    logp(x;θ).\mathcal L(\phi, \theta) \;=\; \log p(\mathbf x; \mathbf \theta) \;-\; KL\!\left(q(\mathbf z | \mathbf x ; \mathbf \phi) \,\|\, p(\mathbf z | \mathbf x; \mathbf \theta)\right) \;\le\; \log p(\mathbf x; \mathbf \theta). Joint SGD over (ϕ,θ)(\phi, \theta) on the ELBO does two coupled jobs at once:
  • Encoder step (gradients on ϕ\phi). With θ\theta fixed, logp(x;θ)\log p(\mathbf x; \mathbf \theta) is constant in ϕ\phi, so maximizing L\mathcal L over ϕ\phi is exactly equivalent to minimizing the posterior KL. The encoder learns to track the true posterior, tightening the bound.
  • Decoder step (gradients on θ\theta). The ELBO is a lower bound on logp(x;θ)\log p(\mathbf x; \mathbf \theta), so increasing L\mathcal L over θ\theta raises a lower bound on the marginal log-likelihood. The marginal itself rises only as fast as the bound is tight; the encoder’s quality controls the slack.
The standard variational picture below illustrates this: the KL gap separates the ELBO from the true log-likelihood, and the gap closes only when qϕq_\phi matches the true posterior. Bishop KL represents the tightness of the ELBO bound In short: the ELBO is the surrogate that lets you train both networks with a single gradient. Optimizing it better fits the data (through θ\theta) and better approximates the posterior (through ϕ\phi) at the same time, with the second move keeping the first one honest. This same ELBO reappears as the DDPM training loss when the single latent z\mathbf z is replaced by a Markov noise chain x1:T\mathbf x_{1:T} and the encoder is fixed in advance — see Training objective: from VAE ELBO to the DDPM loss on the diffusion introduction page for the bridge.