MLE and MAP

May 05, 2024

Maximum Likelihood Estimation (MLE)

Let $D = \{D_i:i=1\dots n\}$ where $D_i = (y_i, X_i)$ and each $D_i$ is iid. Suppose that there is some parameter $\theta \in \Theta$ . Then the joint probability density function (or probability mass function) $P(D\mid\theta)$ - or the likelihood $L_n(\theta)$ is

L_n(\theta) = \prod_{i=1}^n P(D_i\mid\theta)

The joint PDF gives us the likelihood of the iid sample given $\theta$ . The negative log likelihood (NLL) then becomes

-\log{L_n(\theta)} = - \sum_{i=1}^n \log{P(D_i\mid\theta)}

The goal is to have some estimate of $\theta$ that describes the samples that we observe. Therefore, we want to maximize the likelihood function $L_n$ with respect to $\theta$ - or minimize the NLL. The maximum likelihood estimator (MLE) is defined as

\hat{{\theta}}_{MLE} \in \underset{\theta \in \Theta}{\arg \min} \{-\log{L_n(\theta)}\}

The MLE estimator $\hat{{\theta}}_{MLE}$ is the value which we are most likely to observe $D$ . In general, $\hat{\theta}_{MLE}$ is obtained using some form of numerical optimization over the NLL.

Hessian and the Covariance Matrix of MLE

The covariance matrix of the parmaeter estimates $\hat{\theta}_{MLE}$ is the negative inverse Hessian of the log likelihood function (or the information matrix, which is the negative of the expected value of the Hessian matrix).

Cov(\hat{\theta}_{MLE}) = -\left[\frac{\partial^2\log L_n(\hat{\theta}_{MLE})}{\partial \theta \partial \theta^{\top}}\right]^{-1}

Asymptotics of MLE

$\underset{n\to\infty}{\lim}{\hat{{\theta}}_{MLE}} = \theta$

MLE is asymptotically consistent. It is also asymptotically normal and this can be proven using Stein's Lemma.

Exponential Family

If each $D_i$ (or $x_i$ ) follows a distribution of an Exponential Family, then $P(x_i\mid\theta)$ follows the form

\frac{h(x_i)\exp(\eta(\theta)^\top s(x_i))}{Z(\theta)}

Examples of exponential families include

Bernoulli Distribution: $P(x_i\mid\theta) = \theta^{1(x_i = 1)}(1-\theta)^{1(x_i = 0)}$
Categorical Distribution: $P(x_i\mid \boldsymbol{\theta}) = \prod_{c=1}^k\theta_c^{1(x_i = c)}$
Normal Distribution: $P(x_i\mid\mu,\sigma) = \frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2\right)$
Multivariate Normal Distribution: $P(\boldsymbol{x}_i\mid \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{1}{(2\pi)^{\frac{d}{2}}\det(\boldsymbol{\Sigma})^{\frac{d}{2}}}\exp\left(-\frac{1}{2}(\boldsymbol{x}_i-\boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1}(\boldsymbol{x}_i-\boldsymbol{\mu})\right)$
Poisson Distribution: $P(x_i \mid \lambda) = \frac{1}{x_i!}\lambda^{x_i}\exp(-\lambda)$

Convenient thing about distributions that follow the exponential family is that the MLE can be obtained from moment matching - i.e., the estimator is just the sample mean (and sample covariance). Thus, there is no need for any numerical optimization method such as grid search or gradient descent.

Maximum a Posteriori (MAP) Estimation

MLE estimates $\theta$ by maximizing the likelihood of seeing $D$ conditional on $\theta$ , but it's more intuitive to think of learning $\theta$ conditional on observing $D$ - i.e. what we really want to find $\theta$ that has the highest prob ability given the data. From Bayes' rule we have

\begin{align*} P(\theta\mid D) & = \frac{P(D\mid \theta)P(\theta)}{P(D)} \\& \propto P(D\mid \theta)P(\theta) \end{align*}

Note that $P(\theta)$ should really be $P(\theta\mid\alpha)$ , where $\alpha$ is the parameter for the prior distribution of $\theta$ . Following the same way MLE estimated $\theta$ from minimizing the NLL, the maximum a posteriori (MAP) estimator can be obtained from minimizing the NLL where $L_n(\theta) = P(\theta\mid D, \alpha) \propto P(D\mid \theta)P(\theta\mid\alpha)$ .

\hat{{\theta}}_{MAP} \in \underset{\theta \in \Theta}{\arg \min} \left\{-\sum_{i=1}^n \log{P{(D_i\mid\theta)}} - \log{P(\theta\mid\alpha)}\right\}

The prior distribution acts as a regularizer, ensuring that the estimator does not overfit, particularly when the number of observations is small. Note that as $n\to\infty$ , $\hat{{\theta}}_{MAP} \to \hat{{\theta}}_{MLE}$ .

Conjugate Priors

If $P(D\mid \theta)$ follows an Exponential Family, there is a prior distribution for $\theta\mid\alpha$ such that $\theta\mid D,\alpha$ will belong to the same family of distributions as the prior. Some examples of conjugate priors are:

Beta-Bernoulli: $x\mid\theta \sim Bern(\theta), \theta\mid\alpha,\beta \sim Beta(\alpha,\beta)$
Categorical-Dirichlet: $x\mid\boldsymbol{\theta} \sim Cat(\boldsymbol{\theta}), \boldsymbol{\theta}\mid\boldsymbol{\alpha} \sim Dir(\boldsymbol{\alpha})$
Normal Distribution: $x\mid\mu,\sigma \sim N(\mu,\sigma), \mu\mid\mu_0,\sigma_0 \sim N(\mu_0,\sigma_0)$
Multivariate Normal Distribution: $\boldsymbol{x}\mid \boldsymbol{\mu}, \boldsymbol{\Sigma} \sim N(\boldsymbol{\mu}, \boldsymbol{\Sigma}), \boldsymbol{\mu}\mid\boldsymbol{\mu}_0,\boldsymbol{\Sigma}_0\sim N(\boldsymbol{\mu}_0,\boldsymbol{\Sigma}_0)$
Gamma-Poisson Distribution: $x\mid \lambda \sim Poisson(\lambda), \lambda\mid\alpha,\beta \sim Gamma(\alpha,\beta)$