Ingredients of Variational Autoencoder (VAE).
- Encoder: qϕ(z∣x)
- Decoder: pθ(x∣z)
- Prior Distribution (bottle neck): pθ(z)=N(0,I)
Ideally, want to choose θ (the decoder) that maximizes the likelihood for the deep latent variable model pθ(x)=∫pθ(x∣z)pθ(z)dz, but integrating through z could be a problem if z is high dimensional.
Solution: Use recognition network
qϕ(z∣x)≈pθ(z∣x)=pθ(x)pθ(x∣z)pθ(z)
ELBO
logpθ(x)=z∼qϕ(z∣x)E(logpθ(x))=z∼qϕ(z∣x)E(logpθ(z∣x)pθ(x,z))=z∼qϕ(z∣x)E(logqϕ(z∣x)pθ(z∣x)pθ(x,z)qϕ(z∣x))=z∼qϕ(z∣x)E(logqϕ(z∣x)pθ(x,z))+z∼qϕ(z∣x)E(logpθ(z∣x)qϕ(z∣x))=ELBOθ,ϕ(x)+KL(qϕ(z∣x)∥pθ(z∣x))
This implies that ϕargmax{ELBOθ,ϕ(x)}=ϕargmin{KL(qϕ(z∣x)∥pθ(z∣x))} because
i=1∑nELBOθ,ϕ(xi)=i=1∑nlogpθ(xi)−KL(qϕ(z∣x)∥pθ(z∣x))
The KL divergence here is the information lost when approximating (or training) the encoder using the decoder network in the reverse direction, which is consistent the idea of recognition network. This should be intuitive because in an autoencoder the network should output the same thing as the input, given the latent representation.
Maximizing ELBO
ELBOθ,ϕ(x)=z∼qϕ(z∣x)E(logpθ(z)qϕ(z∣x)pθ(x,z)pθ(z))=z∼qϕ(z∣x)E(logpθ(x∣z))−KL(qϕ(z∣x)∥pθ(z))
Note that the KL divergence is easy to compute because pθ(z) is just the bottle neck and qθ(z∣x) is the encoder network. To compute the expectation use the reparametirization trick such that z∼qϕ(z∣x)E(logpθ(x∣z))=ϵ∼N(0,I)E(logpθ(μϕ(x)+Σϕ(x)21ϵ)); take samples from q using samples from ϵ, run the decoder network, then use Monte Carlo to estimate. With this, finally take the autodiff on the expectation and run SGD to optimize the ELBO.