MLE and MAP

May 05, 2024

Maximum Likelihood Estimation (MLE)

Let D={Di:i=1n}D = \{D_i:i=1\dots n\} where Di=(yi,Xi)D_i = (y_i, X_i) and each DiD_i is iid. Suppose that there is some parameter θΘ\theta \in \Theta. Then the joint probability density function (or probability mass function) P(Dθ)P(D\mid\theta) - or the likelihood Ln(θ)L_n(\theta) is

Ln(θ)=i=1nP(Diθ)L_n(\theta) = \prod_{i=1}^n P(D_i\mid\theta)

The joint PDF gives us the likelihood of the iid sample given θ\theta. The negative log likelihood (NLL) then becomes

logLn(θ)=i=1nlogP(Diθ)-\log{L_n(\theta)} = - \sum_{i=1}^n \log{P(D_i\mid\theta)}

The goal is to have some estimate of θ\theta that describes the samples that we observe. Therefore, we want to maximize the likelihood function LnL_n with respect to θ\theta - or minimize the NLL. The maximum likelihood estimator (MLE) is defined as

θ^MLEargminθΘ{logLn(θ)}\hat{{\theta}}_{MLE} \in \underset{\theta \in \Theta}{\arg \min} \{-\log{L_n(\theta)}\}

The MLE estimator θ^MLE\hat{{\theta}}_{MLE} is the value which we are most likely to observe DD. In general, θ^MLE\hat{\theta}_{MLE} is obtained using some form of numerical optimization over the NLL.

Hessian and the Covariance Matrix of MLE

The covariance matrix of the parmaeter estimates θ^MLE\hat{\theta}_{MLE} is the negative inverse Hessian of the log likelihood function (or the information matrix, which is the negative of the expected value of the Hessian matrix).

Cov(θ^MLE)=[2logLn(θ^MLE)θθ]1Cov(\hat{\theta}_{MLE}) = -\left[\frac{\partial^2\log L_n(\hat{\theta}_{MLE})}{\partial \theta \partial \theta^{\top}}\right]^{-1}

Asymptotics of MLE

limnθ^MLE=θ\underset{n\to\infty}{\lim}{\hat{{\theta}}_{MLE}} = \theta

MLE is asymptotically consistent. It is also asymptotically normal and this can be proven using Stein's Lemma.

Exponential Family

If each DiD_i (or xix_i) follows a distribution of an Exponential Family, then P(xiθ)P(x_i\mid\theta) follows the form

h(xi)exp(η(θ)s(xi))Z(θ)\frac{h(x_i)\exp(\eta(\theta)^\top s(x_i))}{Z(\theta)}

Examples of exponential families include

  • Bernoulli Distribution: P(xiθ)=θ1(xi=1)(1θ)1(xi=0)P(x_i\mid\theta) = \theta^{1(x_i = 1)}(1-\theta)^{1(x_i = 0)}
  • Categorical Distribution: P(xiθ)=c=1kθc1(xi=c)P(x_i\mid \boldsymbol{\theta}) = \prod_{c=1}^k\theta_c^{1(x_i = c)}
  • Normal Distribution: P(xiμ,σ)=1σ2πexp(12(xμσ)2)P(x_i\mid\mu,\sigma) = \frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2\right)
  • Multivariate Normal Distribution: P(xiμ,Σ)=1(2π)d2det(Σ)d2exp(12(xiμ)Σ1(xiμ))P(\boldsymbol{x}_i\mid \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{1}{(2\pi)^{\frac{d}{2}}\det(\boldsymbol{\Sigma})^{\frac{d}{2}}}\exp\left(-\frac{1}{2}(\boldsymbol{x}_i-\boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1}(\boldsymbol{x}_i-\boldsymbol{\mu})\right)
  • Poisson Distribution: P(xiλ)=1xi!λxiexp(λ)P(x_i \mid \lambda) = \frac{1}{x_i!}\lambda^{x_i}\exp(-\lambda)

Convenient thing about distributions that follow the exponential family is that the MLE can be obtained from moment matching - i.e., the estimator is just the sample mean (and sample covariance). Thus, there is no need for any numerical optimization method such as grid search or gradient descent.

Maximum a Posteriori (MAP) Estimation

MLE estimates θ\theta by maximizing the likelihood of seeing DD conditional on θ\theta, but it's more intuitive to think of learning θ\theta conditional on observing DD - i.e. what we really want to find θ\theta that has the highest prob ability given the data. From Bayes' rule we have

P(θD)=P(Dθ)P(θ)P(D)P(Dθ)P(θ)\begin{align*} P(\theta\mid D) & = \frac{P(D\mid \theta)P(\theta)}{P(D)} \\& \propto P(D\mid \theta)P(\theta) \end{align*}

Note that P(θ)P(\theta) should really be P(θα)P(\theta\mid\alpha), where α\alpha is the parameter for the prior distribution of θ\theta. Following the same way MLE estimated θ\theta from minimizing the NLL, the maximum a posteriori (MAP) estimator can be obtained from minimizing the NLL where Ln(θ)=P(θD,α)P(Dθ)P(θα)L_n(\theta) = P(\theta\mid D, \alpha) \propto P(D\mid \theta)P(\theta\mid\alpha).

θ^MAPargminθΘ{i=1nlogP(Diθ)logP(θα)}\hat{{\theta}}_{MAP} \in \underset{\theta \in \Theta}{\arg \min} \left\{-\sum_{i=1}^n \log{P{(D_i\mid\theta)}} - \log{P(\theta\mid\alpha)}\right\}

The prior distribution acts as a regularizer, ensuring that the estimator does not overfit, particularly when the number of observations is small. Note that as nn\to\infty, θ^MAPθ^MLE\hat{{\theta}}_{MAP} \to \hat{{\theta}}_{MLE}.

Conjugate Priors

If P(Dθ)P(D\mid \theta) follows an Exponential Family, there is a prior distribution for θα\theta\mid\alpha such that θD,α\theta\mid D,\alpha will belong to the same family of distributions as the prior. Some examples of conjugate priors are:

  • Beta-Bernoulli: xθBern(θ),θα,βBeta(α,β)x\mid\theta \sim Bern(\theta), \theta\mid\alpha,\beta \sim Beta(\alpha,\beta)
  • Categorical-Dirichlet: xθCat(θ),θαDir(α)x\mid\boldsymbol{\theta} \sim Cat(\boldsymbol{\theta}), \boldsymbol{\theta}\mid\boldsymbol{\alpha} \sim Dir(\boldsymbol{\alpha})
  • Normal Distribution: xμ,σN(μ,σ),μμ0,σ0N(μ0,σ0)x\mid\mu,\sigma \sim N(\mu,\sigma), \mu\mid\mu_0,\sigma_0 \sim N(\mu_0,\sigma_0)
  • Multivariate Normal Distribution: xμ,ΣN(μ,Σ),μμ0,Σ0N(μ0,Σ0)\boldsymbol{x}\mid \boldsymbol{\mu}, \boldsymbol{\Sigma} \sim N(\boldsymbol{\mu}, \boldsymbol{\Sigma}), \boldsymbol{\mu}\mid\boldsymbol{\mu}_0,\boldsymbol{\Sigma}_0\sim N(\boldsymbol{\mu}_0,\boldsymbol{\Sigma}_0)
  • Gamma-Poisson Distribution: xλPoisson(λ),λα,βGamma(α,β)x\mid \lambda \sim Poisson(\lambda), \lambda\mid\alpha,\beta \sim Gamma(\alpha,\beta)