Variational Autoencoder

May 13, 2024

Ingredients of Variational Autoencoder (VAE).

  • Encoder: qϕ(zx)q_\phi(z\mid x)
  • Decoder: pθ(xz)p_\theta(x\mid z)
  • Prior Distribution (bottle neck): pθ(z)=N(0,I)p_\theta(z) = N(0,\boldsymbol{I})

Ideally, want to choose θ\theta (the decoder) that maximizes the likelihood for the deep latent variable model pθ(x)=pθ(xz)pθ(z)dzp_\theta(x) = \int p_\theta(x\mid z)p_\theta(z) dz, but integrating through zz could be a problem if zz is high dimensional.

Solution: Use recognition network

qϕ(zx)pθ(zx)=pθ(xz)pθ(z)pθ(x)q_\phi(z\mid x) \approx p_\theta(z\mid x) = \frac{p_\theta(x\mid z)p_\theta(z)}{p_\theta(x)}

ELBO

logpθ(x)=Ezqϕ(zx)(logpθ(x))=Ezqϕ(zx)(logpθ(x,z)pθ(zx))=Ezqϕ(zx)(logpθ(x,z)qϕ(zx)qϕ(zx)pθ(zx))=Ezqϕ(zx)(logpθ(x,z)qϕ(zx))+Ezqϕ(zx)(logqϕ(zx)pθ(zx))=ELBOθ,ϕ(x)+KL(qϕ(zx)pθ(zx))\begin{align*} \log{p_\theta(x)} & = \underset{z\sim q_\phi(z\mid x)}{E}\left(\log{p_\theta(x)}\right) \\& = \underset{z\sim q_\phi(z\mid x)}{E}\left(\log{\frac{p_\theta(x,z)}{p_\theta(z\mid x)}}\right) \\& = \underset{z\sim q_\phi(z\mid x)}{E}\left(\log{\frac{p_\theta(x,z)q_\phi(z\mid x)}{q_\phi(z\mid x)p_\theta(z\mid x)}}\right) \\& = \underset{z\sim q_\phi(z\mid x)}{E}\left(\log{\frac{p_\theta(x,z)}{q_\phi(z\mid x)}}\right) + \underset{z\sim q_\phi(z\mid x)}{E}\left(\log{\frac{q_\phi(z\mid x)}{p_\theta(z\mid x)}}\right) \\& = \text{ELBO}_{\theta,\phi}(x) + \text{KL}(q_\phi(z\mid x)\parallel p_\theta(z\mid x)) \end{align*}

This implies that argmaxϕ{ELBOθ,ϕ(x)}=argminϕ{KL(qϕ(zx)pθ(zx))}\underset{\phi}{\arg\max}\{\text{ELBO}_{\theta,\phi}(x)\} = \underset{\phi}{\arg\min}\{\text{KL}(q_\phi(z\mid x)\parallel p_\theta(z\mid x))\} because

i=1nELBOθ,ϕ(xi)=i=1nlogpθ(xi)KL(qϕ(zx)pθ(zx))\sum_{i=1}^n \text{ELBO}_{\theta,\phi}(x_i) = \sum_{i=1}^n \log{p_\theta(x_i)}-\text{KL}(q_\phi(z\mid x)\parallel p_\theta(z\mid x))

The KL divergence here is the information lost when approximating (or training) the encoder using the decoder network in the reverse direction, which is consistent the idea of recognition network. This should be intuitive because in an autoencoder the network should output the same thing as the input, given the latent representation.

Maximizing ELBO

ELBOθ,ϕ(x)=Ezqϕ(zx)(logpθ(x,z)pθ(z)pθ(z)qϕ(zx))=Ezqϕ(zx)(logpθ(xz))KL(qϕ(zx)pθ(z))\begin{align*} \text{ELBO}_{\theta,\phi}(x) & = \underset{z\sim q_\phi(z\mid x)}{E}\left(\log{\frac{p_\theta(x,z)p_\theta(z)}{p_\theta(z)q_\phi(z\mid x)}}\right) \\& = \underset{z\sim q_\phi(z\mid x)}{E}\left(\log{p_\theta(x\mid z)}\right) - \text{KL}(q_\phi(z\mid x)\parallel p_\theta(z)) \end{align*}

Note that the KL divergence is easy to compute because pθ(z)p_\theta(z) is just the bottle neck and qθ(zx)q_\theta(z\mid x) is the encoder network. To compute the expectation use the reparametirization trick such that Ezqϕ(zx)(logpθ(xz))=EϵN(0,I)(logpθ(μϕ(x)+Σϕ(x)12ϵ))\underset{z\sim q_\phi(z\mid x)}{E}\left(\log{p_\theta(x\mid z)}\right) = \underset{\epsilon\sim N(0, \boldsymbol{I})}{E}\left(\log{p_\theta(\boldsymbol{\mu}_\phi(x) + \boldsymbol{\Sigma}_\phi(x)^{\frac{1}{2}}\epsilon)}\right); take samples from qq using samples from ϵ\epsilon, run the decoder network, then use Monte Carlo to estimate. With this, finally take the autodiff on the expectation and run SGD to optimize the ELBO.