Let D={Di:i=1…n} where Di=(yi,Xi) and each Di is iid. Suppose that there is some parameter θ∈Θ. Then the joint probability density function (or probability mass function) P(D∣θ) - or the likelihood Ln(θ) is
Ln(θ)=i=1∏nP(Di∣θ)
The joint PDF gives us the likelihood of the iid sample given θ. The negative log likelihood (NLL) then becomes
−logLn(θ)=−i=1∑nlogP(Di∣θ)
The goal is to have some estimate of θ that describes the samples that we observe. Therefore, we want to maximize the likelihood function Ln with respect to θ - or minimize the NLL. The maximum likelihood estimator (MLE) is defined as
θ^MLE∈θ∈Θargmin{−logLn(θ)}
The MLE estimator θ^MLE is the value which we are most likely to observe D. In general, θ^MLE is obtained using some form of numerical optimization over the NLL.
Hessian and the Covariance Matrix of MLE
The covariance matrix of the parmaeter estimates θ^MLE is the negative inverse Hessian of the log likelihood function (or the information matrix, which is the negative of the expected value of the Hessian matrix).
Cov(θ^MLE)=−[∂θ∂θ⊤∂2logLn(θ^MLE)]−1
Asymptotics of MLE
n→∞limθ^MLE=θ
MLE is asymptotically consistent. It is also asymptotically normal and this can be proven using Stein's Lemma.
Exponential Family
If each Di (or xi) follows a distribution of an Exponential Family, then P(xi∣θ) follows the form
Normal Distribution: P(xi∣μ,σ)=σ2π1exp(−21(σx−μ)2)
Multivariate Normal Distribution: P(xi∣μ,Σ)=(2π)2ddet(Σ)2d1exp(−21(xi−μ)⊤Σ−1(xi−μ))
Poisson Distribution: P(xi∣λ)=xi!1λxiexp(−λ)
Convenient thing about distributions that follow the exponential family is that the MLE can be obtained from moment matching - i.e., the estimator is just the sample mean (and sample covariance). Thus, there is no need for any numerical optimization method such as grid search or gradient descent.
Maximum a Posteriori (MAP) Estimation
MLE estimates θ by maximizing the likelihood of seeing D conditional on θ, but it's more intuitive to think of learning θ conditional on observing D - i.e. what we really want to find θ that has the highest prob ability given the data. From Bayes' rule we have
P(θ∣D)=P(D)P(D∣θ)P(θ)∝P(D∣θ)P(θ)
Note that P(θ) should really be P(θ∣α), where α is the parameter for the prior distribution of θ. Following the same way MLE estimated θ from minimizing the NLL, the maximum a posteriori (MAP) estimator can be obtained from minimizing the NLL where Ln(θ)=P(θ∣D,α)∝P(D∣θ)P(θ∣α).
θ^MAP∈θ∈Θargmin{−i=1∑nlogP(Di∣θ)−logP(θ∣α)}
The prior distribution acts as a regularizer, ensuring that the estimator does not overfit, particularly when the number of observations is small. Note that as n→∞, θ^MAP→θ^MLE.
Conjugate Priors
If P(D∣θ) follows an Exponential Family, there is a prior distribution for θ∣α such that θ∣D,α will belong to the same family of distributions as the prior. Some examples of conjugate priors are:
Beta-Bernoulli: x∣θ∼Bern(θ),θ∣α,β∼Beta(α,β)
Categorical-Dirichlet: x∣θ∼Cat(θ),θ∣α∼Dir(α)
Normal Distribution: x∣μ,σ∼N(μ,σ),μ∣μ0,σ0∼N(μ0,σ0)
Multivariate Normal Distribution: x∣μ,Σ∼N(μ,Σ),μ∣μ0,Σ0∼N(μ0,Σ0)