Bayesian Inference

May 11, 2024

In a frequentist setting, a statistician or econometrician typically commits to one specific parameter θ^\hat{\theta} after observing y\boldsymbol{y} and X\boldsymbol{X} and uses this parameter to dictate the probability of outcomes. Therefore, for an unobserved data y~,X~\tilde{y},\tilde{X} the likelihood is

P(y~θ^,X~)P(\tilde{y}\mid \hat{\theta},\tilde{X})

However, in a dynamic real-world environment, it is often unrealistic to assume that the parameter remains fixed, given the continual changes and updates that influence underlying conditions.

Posterior Predictive

Instead of committing to one parameter to compute the likelihood of an unobserved data, the posterior predictve computes the weighted average of the likelihoods across all possible θ\theta, given its parameter α\alpha.

P(y~X~,y,X,α)=P(y~θ,X~)P(θy,X,α)dθP(\tilde{y}\mid\tilde{X}, \boldsymbol{y}, \boldsymbol{X}, \alpha)=\int P(\tilde{y}\mid\theta, \tilde{X})P(\theta\mid \boldsymbol{y}, \boldsymbol{X}, \alpha)d\theta

Bayesian Linear Regression

Consider the following linear framework: y=Xθ+Uy = X\theta + U

UXN(0,σ2),θN(0,λ1I)U\mid X \sim N(0, \sigma^2), \theta \sim N(0, \lambda^{-1}\boldsymbol{I})

Note that UXN(0,σ2)U\mid X \sim N(0, \sigma^2) implies yX(Xθ,σ2)y\mid X \sim(X\theta, \sigma^2). Then, the posterior has the form

θN(θMAP,(1σ2XX+λI)1),θMAP=(XX+λσ2I)1Xy\theta \sim N\left(\theta_{MAP}, \left(\frac{1}{\sigma^2}X^\top X+\lambda \boldsymbol{I}\right)^{-1}\right), \theta_{MAP} = \left(X^\top X+\frac{\lambda}{\sigma^2}\boldsymbol{I}\right)^{-1}X^\top y
y~X,y,X~N(θMAPX~,σ2+X~(1σ2XX+λI)X~)\tilde{y} \mid \boldsymbol{X}, \boldsymbol{y}, \tilde{X} \sim N\left(\theta_{MAP}^\top\tilde{X}, \sigma^2+\tilde{X}^\top\left(\frac{1}{\sigma^2}X^\top X+\lambda \boldsymbol{I}\right)\tilde{X}\right)

Gaussian Process

Gaussian Process is the distribution over functions. For a more general case that can be expanded to non-linear framework suppose y=f(X)+Uy = f(X) + U. Then ff follows a Gaussian Process

f(X)N(m(X),K(X,X))f(X) \sim N(m(X), K(X,X'))

Linear Basis Function

For example, ff can depend on the parameter θ\theta where θN(0,λ1I)\theta \sim N(0, \lambda^{-1}\boldsymbol{I}). Consider the case f(X)=θϕ(X)f(X) = \theta^\top \phi(X). Then, m(X)m(X) is as follows

m(X)=E(θϕ(X))=E(θ)ϕ(X)=0\begin{align*} m(X) & = E(\theta^\top \phi(X)) \\& = E(\theta^\top)\phi(X) \\& = 0 \end{align*}

Kernel

The kernel KK determines the covariance over the functions, where K(X,X)=cov(X,X)K(X, X') = cov(X, X'). A common choice is the RBF kernel.

K(X,X)=α2exp(XX22l2)K(X,X') = \alpha^2\exp\left(\frac{-||X-X'||^2}{2l^2}\right)

ref

Empirical Bayes

Instead of using cross validation to find the best hyperparmeter α\alpha, which can potentially be biased towards the validation set, an alternative approach is to use Empirical Bayes.

α^argmaxα{P(y,Xα)}\hat{\alpha} \in \underset{\alpha}{\arg\max}\left\{P(y, X\mid\alpha)\right\}
P(y,Xα)=P(y,Xα)P(θα)dθP(y, X\mid\alpha) = \int P(y,X\mid\alpha)P(\theta\mid\alpha)d\theta