Skip to main content
In a previous section we have seen that VAE helps us define the latent space. The ‘right’ latent space is the one that makes the distribution p(zθ)p(\mathbf z| \mathbf \theta) the most likely to produce x\mathbf x. We are therefore introducing a stage that complements the aforementioned generative model or decoder given by p(xz;θ)p(zθ)p(\mathbf x| \mathbf z ; \mathbf \theta) p(\mathbf z | \theta). This stage is called the recognition model or encoder and is given by p(zx;θ)p(\mathbf z| \mathbf x ; \mathbf \theta). The premise is this: the posterior p(zx;θ)p(\mathbf z | \mathbf x ; \mathbf \theta) will result into a much more meaningful and compact latent space z\mathbf z than the prior p(zθ)p(\mathbf z | \mathbf \theta). This encoding though, calls for sampling from a posterior that is itself intractable. We then need to use an approximation to such distribution: q(zx;ϕ)q(\mathbf z| \mathbf x ; \mathbf \phi) and we call this the inference model that approximates the recognition model and help us optimize the marginal likelihood. The VAE encoder-decoder spaces are shown below. The picture shows the more compact space that is defined by the encoder. vae VAE spaces and distributions (from here) The architecture of VAE includes four main components as shown below: vae VAE Architecture (from here) Similar to the generative model, the inference model can be, in general, a PGM of the form: q(zx;ϕ)=j=1Mq(zjPa(zj),x;ϕ)q(\mathbf z | \mathbf x ; \mathbf \phi) = \prod_{j=1}^M q(\mathbf z_j | Pa(\mathbf z_j), \mathbf x ; \mathbf \phi) and this, similarly to the generative model, can be parametrized with a DNNenc(ϕ)DNN_{enc}(\phi). More specifically we obtain the approximation using the following construction: (μ,logΣ)=DNNenc(x,ϕ) (\mathbf \mu, \log \mathbf \Sigma ) = DNN_{enc}(\mathbf x, \mathbf \phi) q(zx;ϕ)=N(z;μ,diag(Σ))q(\mathbf z| \mathbf x ; \mathbf \phi) = N(\mathbf z; \mathbf \mu, \textsf{diag}(\mathbf \Sigma)) The DNNencDNN_{enc} implements amortized variational inference, that is, it estimates the posterior parameters over a batch of datapoints and this offers significant boost in the parameter learning. With the encoder defined, the next question is how to train its parameters ϕ\phi jointly with the decoder parameters θ\theta when the true posterior p(zx;θ)p(\mathbf z | \mathbf x; \mathbf \theta) is intractable. The answer is the Evidence Lower Bound (ELBO) — a tractable surrogate for the marginal log-likelihood that is derived from the KL divergence between qq and the true posterior. The derivation and its consequences for joint optimization are covered in the Optimization and the ELBO page.