Linear Models

May 02, 2024

Linear Classfiers

Let us assume that we believe that a certain outcome yy is a linear combination of X1,,XdX_1, \dots ,X_d and some unobservable UU.

y=Xβ+U:yRn×1,XRn×d,βRd×1,URn×1y = X\beta + U: y \in \mathbb{R}^{n \times 1}, X \in \mathbb{R}^{n \times d}, \beta \in \mathbb{R}^{d \times 1}, U \in \mathbb{R}^{n \times 1}

OLS Estimator

We want to minimize the objective function with respect to β\beta.

f(β)=12yXβ2=12(yXβ)(yXβ)f(\beta) = \frac{1}{2}||y-X\beta||^2 = \frac{1}{2}(y-X\beta)^\top (y-X\beta)

Minimizing f(β)f(\beta) will give us the OLS estimator β^\hat{\beta}. Note that f(β)f(\beta) is convex so a global minimum exists. However, we must assume that XX is full rank (i.e., no multi-collinearity). Otherwise, we will have a non-unique solution.

β^argminβ{12(yXβ)(yXβ)}=(XX)1Xy\begin{align*} \hat{\beta} & \in \underset{\beta}{\arg \min} \left\{\frac{1}{2}(y-X\beta)^\top (y-X\beta)\right\} \\& = (X^\top X)^{-1}X^\top y \end{align*}

If we have the matrix of XX and vector yy, then the OLS estimator is very easy to compute. The overall computation cost is O(nd2+d3)O(nd^2 + d^3) because:

  • Solving XyX^\top y is O(nd)O(nd).
  • Solving XXX^\top X is O(nd2)O(nd^2).
  • Solving β^=(XX)1Xy\hat{\beta} = (X^\top X)^{-1}X^\top y is O(d3)O(d^3)

Robust Regression

f(β)=yXβ1=i=1nyiXiβf(\beta) = ||y-X\beta||_1 = \sum_{i=1}^n |y_i-X_i\beta|

Sum of squared errors is sensitive to outliers while the L1 norm of error is not influenced by outliers. The idea is that the sum of squared errors for outliers will be much greater than the absolute error.

Brittle Regression

f(β)=yXβ=max{yiXiβ}f(\beta) = ||y-X\beta||_\infty = \max\{|y_i-X_i\beta|\}
max{yiXiβ}log(i=1nexp(yiXiβ))\max\{|y_i-X_i\beta|\} \approx \log{\left(\sum_{i=1}^n\exp{(y_i-X_i\beta)}\right)}

The brittle regression would be very sensitive to outliers. Therefore, if we want to have a line of best fit through the outliers we would minimize this objective function.

Lasso

f(β)=12yXβ2+αβ1f(\beta) = \frac{1}{2}||y-X\beta||^2 + \alpha||\beta||_1

β1=j=1dβj||\beta||_1 = \sum_{j=1}^d|\beta_j| (i.e., L1 norm - sum of all absolute values) and α\alpha is the regularization parameter. Now the objective function has the squared error + size of βj\beta_js. Therefore, the bigger each βj\beta_j term is, the more penalty we will have on the objective function. So in order to minimize the objective function, we would want to shrink some of the βj\beta_js.

β^Lasso=argminβ{12yXβ2+αβ1}\hat{\beta}_{Lasso} = \underset{\beta}{\arg \min}\left\{\frac{1}{2}||y-X\beta||^2 + \alpha||\beta||_1\right\}

Note that the L1 norm is not differentiable at the minimum point (i.e., it is convex, but it is not smooth). However, there is still a closed form solution for β^Lasso\hat{\beta}_{Lasso}.

Minimizing the objective function will allow us to shrink some βj\beta_j such that β^j=0|\hat{\beta}_j| = 0 for some ii's. This in turn is useful for feature selection. The higher the α\alpha is, the more feature we would shrink β\beta.

Ridge

f(β)=12yXβ2+α2β2f(\beta) = \frac{1}{2}||y-X\beta||^2 + \frac{\alpha}{2}||\beta||^2

Instead of the L1 norm like we used in lasso, we use the L2 norm of the β\beta for the penalizing term.

β^Ridgeargminβ{12yXβ2+α2β2}=(XX+αI)1Xy\begin{align*} \hat{\beta}_{Ridge} & \in \underset{\beta}{\arg \min}\left\{\frac{1}{2}||y-X\beta||^2 + \frac{\alpha}{2}||\beta||^2\right\} \\& = (X^\top X + \alpha I)^{-1}X^\top y \end{align*}

Note that XX+αIX^\top X + \alpha I is always full rank, so ridge regression can be a solution for multicollinear data.

The Ridge regression will make the model less sensitive to overfitting. This implies that as α\alpha increases the test error decreases.

Binary

Now what if y{0,1}y\in\{0,1\}? The linear models above won't do justice because with specific values of XX, we can predict y^=Xβ^\hat{y} = X\hat{\beta} to be some value way greater than 11 or way less than 00 (unless all coefficients in β^=0\hat{\beta} = 0).

Hinge Loss

Let us assume that instead of y{0,1}y\in\{0,1\} we have y{1,1}y\in\{-1,1\}. We don't want to use OLS for binary classification problems because we want to assign yi=1y_i = 1 if true, and yi=1y_i = -1 if false.

What we want is a 0-1 loss; if the prediction is correct we add 00 penalty but if it's wrong we add 11 penalty, but this is hard to minimize because it's not convex. Therefore use the Hinge loss for a convex approximation of the 0-1 loss.

f(β)=i=1nmax{0,1yiβXi}f(\beta) = \sum_{i=1}^n\max\{0, 1-y_i\beta^\top X_i\}

The idea is that if yi>0y_i > 0 and βXi>0\beta^\top X_i > 0, then we made the right prediction. If yi<0y_i < 0 and βXi<0\beta^\top X_i < 0, we also made the right prediction. Therefore if yiβXi>0y_i\beta^\top X_i > 0, we have the right prediction (i.e., the loss is 0).

errori={0if yiβXi>0yiβXiif yiβXi<0 error_i= \begin{cases} 0 & \text{if } y_i\beta^\top X_i > 0 \\ -y_i\beta^\top X_i & \text{if } y_i\beta^\top X_i < 0 \end{cases}

However, with the loss function will be f=i=1nmax{0,yiβXi}f = \sum_{i=1}^n\max\{0, y_i\beta^\top X_i\} and the minimizer for this would simply be β=0\beta = 0. Therefore, instead of letting yiβXi>0y_i\beta^\top X_i > 0 imply truly classifying ii consider yiβXi1y_i\beta^\top X_i \geq 1. This will be the non-degenerative form and the new error becomes

errori={0if yiβXi101yiβXiif yiβXi1<0 error_i= \begin{cases} 0 & \text{if } y_i\beta^\top X_i - 1 \geq 0 \\ 1-y_i\beta^\top X_i & \text{if } y_i\beta^\top X_i - 1 < 0 \end{cases}

Summing over the new error terms will give if the Hinge loss function. Note that the Hinge-loss will be a mirrored version of the graph below.

SVM

SVM is Hinge loss with L2 regularization

f(β)=i=1nmax{0,1yiβXi}+α2β2f(\beta) = \sum_{i=1}^n\max\{0, 1-y_i\beta^\top X_i\} + \frac{\alpha}{2}||\beta||^2

Logistic Loss

Notice that max{0,yiβXi}log(exp(0)+exp(yiβXi))=log(1+exp(yiβXi))\max\{0, -y_i\beta^\top X_i\} \approx \log(exp(0)+\exp(-y_i\beta^\top X_i)) = \log(1+\exp(-y_i\beta^\top X_i))

f(β)=i=1nlog(1+exp(yiβXi))f(\beta) = \sum_{i=1}^n \log(1+\exp(-y_i\beta^\top X_i))

Sigmoid

Let zi=βXiz_i = \beta^\top X_i. Then to get the probability that zi=1z_i = 1, we can use the sigmoid function.

h(zi)=11+exp(zi)h(z_i) = \frac{1}{1+\exp(-z_i)}
P(yi=1β,Xi)=11+exp(βXi)P(y_i = 1 \mid \beta, X_i) = \frac{1}{1+\exp(-\beta^\top X_i)}

Multi Class Linear Classifiers (Multi-Class SVM)

Now β\beta is d×kd \times k where kk is the number of classes we want to classify. βc\beta_c^\top is the cc'th row of the β\beta matrix, and βcXi\beta_c^\top X_i gives the prediction score for class cc.

y^i=maxc{βcXi}\hat{y}_i = \underset{c}{\max}\{\beta_c^\top X_i\}

Going back to the hinge-loss, we avoid non trivial solution by aiming for yiβXi1y_i\beta^\top X_i \geq 1 for the non-degenerative form.

Let yiy_i be the true class for sample ii. We want βyiXi>βcXi\beta_{y_i}^\top X_i > \beta_c^\top X_i for all ctc\neq t, but we use βyiXiβcXi+1\beta_{y_i}^\top X_i \geq \beta_c^\top X_i + 1.

There are two possible loss that for multi class linear classifiers.

cyimax{0,1βyiX+βcXi}\sum_{c\neq y_i} \max\{0, 1-\beta_{y_i}^\top X+\beta_c^\top X_i\}
maxcyi{max{0,1βyiX+βcXi}}\mathop{\max}_{c\neq y_i}\{\max\{0, 1-\beta_{y_i}^\top X+\beta_c^\top X_i\}\}

Sum peanlizes for each cc that violates the constraint for each sample ii and max penalizes for one cc that violates the constraint the most.

To get Multi-Class SVM, add the L2-regularization.

f(β)=i=1ncyimax{0,1βyiX+βcXi}+α2c=1kβc2f(\beta) = \sum_{i=1}^n\sum_{c\neq y_i} \max\{0, 1-\beta_{y_i}^\top X+\beta_c^\top X_i\} + \frac{\alpha}{2}\sum_{c=1}^k{||\beta_c||}^2

We can write c=1kβc2\sum_{c=1}^k{||\beta_c||}^2 as the Frobenius.

βF=c=1kj=1dβjc2||\beta||_F = \sqrt{\sum_{c=1}^k\sum_{j=1}^d\beta_{jc}^2}
βF2=c=1kj=1dβjc2=c=1kβc2||\beta||_F^2 = \sum_{c=1}^k\sum_{j=1}^d\beta_{jc}^2 = \sum_{c=1}^k{||\beta_c||}^2

So the Multi-Class SVM written using the Frobenius norm becomes

f(β)=i=1ncyimax{0,1βyiX+βcXi}+α2βF2f(\beta) = \sum_{i=1}^n\sum_{c\neq y_i} \max\{0, 1-\beta_{y_i}^\top X+\beta_c^\top X_i\} + \frac{\alpha}{2}||\beta||_F^2

Multi-Class Logistic Regression and Soft Max Loss

The degenerate constraint in the multi-class case and its approximation can be written as

βyiXimaxc{βcXi}\beta_{y_i}^\top X_i\geq \underset{c}{\max}\{\beta_c^\top X_i\}
βyiXilog(c=1kexp(βcXi))\beta_{y_i}^\top X_i\geq \log\left(\sum_{c=1}^k \exp(\beta_c^\top X_i)\right)

The loss function from multi-class smoothed approximation degenerate max constraint is the soft max loss. When k=2k = 2, then it is equivalent to the binary logistic loss.

f(β)=i=1n[βyiXi+log(c=1kexp(βcXi))]+α2c=1kβc2f(\beta) = \sum_{i=1}^n\left[-\beta_{y_i}^\top X_i + \log\left(\sum_{c=1}^k \exp(\beta_c^\top X_i)\right)\right] + \frac{\alpha}{2}\sum_{c=1}^k{||\beta_c||}^2

Also, the soft max function gives us

P(yi=cβ,Xi)=exp(βcXi)c=1kexp(βcXi)P(y_i = c \mid \beta, X_i) = \frac{\exp(\beta_{c}^\top X_i)}{\sum_{c'=1}^k \exp(\beta_{c'}^\top X_i)}