KL Divergence
Use a distribution q to approximate p. A common measure of the "distance" or "divergence" between the two distribution is Kullback-Leibler (KL) Divergence.
KL(p∥q)=∫p(x)logq(x)p(x)dx
The formal definition is the information lost when p is approximated with q. Therefore, the goal would be to minimize the KL Divergence. Note that KL(p∥q)=KL(q∥p). It turns out that minimizing KL(p∥pθ) gives the MLE.
Reverse KL Divergnece
It is easier to minimize the reverse KL Divergence KL(q∥p) with respect to q when approximating p with q.
KL(q∥p)=∫q(x)logq(x)dx−∫q(x)logp(x)dx=∫q(x)logq(x)dx−∫q(x)logp~(x)dx+∫q(x)logZdx=x∼qE(logq(x))−x∼qE(logp~(x))+x∼qE(logq(Z))=−H(q)−x∼qE(logp~(x))+x∼qE(logq(Z))
qargmin{KL(q∥p)}=qargmax{x∼qE(logp~(x))+H(q)}
It's hard to optimize over an expectation so use reparametrization trick.
Example: q is a multi variate gaussian.
Maximizing over q means maximizing over μ and Σ (the parameter of the distribution).
Then the entropy H(q)=21log∣Σ∣+2dlog2πe, so the maximization problem becomes
μ,Σargmax{x∼N(μ,Σ)E(logp~(x))+21log∣Σ∣}
Use Cholesky Decomposition to reparametize q - Σ=LL⊤. That is, given a standard normal distribution z∼N(0,I), q=μ+Lz. Therefore, the maximization problem can be reparametized to
μ,Σargmax{z∼N(0,I)E(logp~(μ+Lz))+21log∣Σ∣}
Note that the expectation can be easily estimated using Monte Carlo because z is easy to sample.