Let us assume that we believe that a certain outcome y is a linear combination of X1,…,Xd and some unobservable U.
y=Xβ+U:y∈Rn×1,X∈Rn×d,β∈Rd×1,U∈Rn×1
OLS Estimator
We want to minimize the objective function with respect to β.
f(β)=21∣∣y−Xβ∣∣2=21(y−Xβ)⊤(y−Xβ)
Minimizing f(β) will give us the OLS estimator β^. Note that f(β) is convex so a global minimum exists. However, we must assume that X is full rank (i.e., no multi-collinearity). Otherwise, we will have a non-unique solution.
β^∈βargmin{21(y−Xβ)⊤(y−Xβ)}=(X⊤X)−1X⊤y
If we have the matrix of X and vector y, then the OLS estimator is very easy to compute. The overall computation cost is O(nd2+d3) because:
Solving X⊤y is O(nd).
Solving X⊤X is O(nd2).
Solving β^=(X⊤X)−1X⊤y is O(d3)
Robust Regression
f(β)=∣∣y−Xβ∣∣1=i=1∑n∣yi−Xiβ∣
Sum of squared errors is sensitive to outliers while the L1 norm of error is not influenced by outliers. The idea is that the sum of squared errors for outliers will be much greater than the absolute error.
Brittle Regression
f(β)=∣∣y−Xβ∣∣∞=max{∣yi−Xiβ∣}
max{∣yi−Xiβ∣}≈log(i=1∑nexp(yi−Xiβ))
The brittle regression would be very sensitive to outliers. Therefore, if we want to have a line of best fit through the outliers we would minimize this objective function.
Lasso
f(β)=21∣∣y−Xβ∣∣2+α∣∣β∣∣1
∣∣β∣∣1=∑j=1d∣βj∣ (i.e., L1 norm - sum of all absolute values) and α is the regularization parameter. Now the objective function has the squared error + size of βjs. Therefore, the bigger each βj term is, the more penalty we will have on the objective function. So in order to minimize the objective function, we would want to shrink some of the βjs.
β^Lasso=βargmin{21∣∣y−Xβ∣∣2+α∣∣β∣∣1}
Note that the L1 norm is not differentiable at the minimum point (i.e., it is convex, but it is not smooth). However, there is still a closed form solution for β^Lasso.
Minimizing the objective function will allow us to shrink some βj such that ∣β^j∣=0 for some i's. This in turn is useful for feature selection. The higher the α is, the more feature we would shrink β.
Ridge
f(β)=21∣∣y−Xβ∣∣2+2α∣∣β∣∣2
Instead of the L1 norm like we used in lasso, we use the L2 norm of the β for the penalizing term.
Note that X⊤X+αI is always full rank, so ridge regression can be a solution for multicollinear data.
The Ridge regression will make the model less sensitive to overfitting. This implies that as α increases the test error decreases.
Binary
Now what if y∈{0,1}? The linear models above won't do justice because with specific values of X, we can predict y^=Xβ^ to be some value way greater than 1 or way less than 0 (unless all coefficients in β^=0).
Hinge Loss
Let us assume that instead of y∈{0,1} we have y∈{−1,1}. We don't want to use OLS for binary classification problems because we want to assign yi=1 if true, and yi=−1 if false.
What we want is a 0-1 loss; if the prediction is correct we add 0 penalty but if it's wrong we add 1 penalty, but this is hard to minimize because it's not convex. Therefore use the Hinge loss for a convex approximation of the 0-1 loss.
f(β)=i=1∑nmax{0,1−yiβ⊤Xi}
The idea is that if yi>0 and β⊤Xi>0, then we made the right prediction. If yi<0 and β⊤Xi<0, we also made the right prediction. Therefore if yiβ⊤Xi>0, we have the right prediction (i.e., the loss is 0).
errori={0−yiβ⊤Xiif yiβ⊤Xi>0if yiβ⊤Xi<0
However, with the loss function will be f=∑i=1nmax{0,yiβ⊤Xi} and the minimizer for this would simply be β=0. Therefore, instead of letting yiβ⊤Xi>0 imply truly classifying i consider yiβ⊤Xi≥1. This will be the non-degenerative form and the new error becomes
Summing over the new error terms will give if the Hinge loss function. Note that the Hinge-loss will be a mirrored version of the graph below.
SVM
SVM is Hinge loss with L2 regularization
f(β)=i=1∑nmax{0,1−yiβ⊤Xi}+2α∣∣β∣∣2
Logistic Loss
Notice that max{0,−yiβ⊤Xi}≈log(exp(0)+exp(−yiβ⊤Xi))=log(1+exp(−yiβ⊤Xi))
f(β)=i=1∑nlog(1+exp(−yiβ⊤Xi))
Sigmoid
Let zi=β⊤Xi. Then to get the probability that zi=1, we can use the sigmoid function.
h(zi)=1+exp(−zi)1
P(yi=1∣β,Xi)=1+exp(−β⊤Xi)1
Multi Class Linear Classifiers (Multi-Class SVM)
Now β is d×k where k is the number of classes we want to classify. βc⊤ is the c'th row of the β matrix, and βc⊤Xi gives the prediction score for class c.
y^i=cmax{βc⊤Xi}
Going back to the hinge-loss, we avoid non trivial solution by aiming for yiβ⊤Xi≥1 for the non-degenerative form.
Let yi be the true class for sample i. We want βyi⊤Xi>βc⊤Xi for all c=t, but we use βyi⊤Xi≥βc⊤Xi+1.
There are two possible loss that for multi class linear classifiers.
c=yi∑max{0,1−βyi⊤X+βc⊤Xi}
maxc=yi{max{0,1−βyi⊤X+βc⊤Xi}}
Sum peanlizes for each c that violates the constraint for each sample i and max penalizes for one c that violates the constraint the most.
To get Multi-Class SVM, add the L2-regularization.
The degenerate constraint in the multi-class case and its approximation can be written as
βyi⊤Xi≥cmax{βc⊤Xi}
βyi⊤Xi≥log(c=1∑kexp(βc⊤Xi))
The loss function from multi-class smoothed approximation degenerate max constraint is the soft max loss. When k=2, then it is equivalent to the binary logistic loss.