Optimization

May 07, 2024

Gradient Descent

Pseudo Code

  • Start with an initial point for β\beta. Let this be β0\beta^0
  • Compute the gradient at β0\beta^0 to generate a better guess β1=β0αβf(β0)\beta^1 = \beta^0 - \alpha\nabla_\beta f(\beta^0) where α\alpha is the step size.
  • repeat this until βt\beta^t where αβf(βt)=0 \alpha\nabla_\beta f(\beta^t) = 0, or until αβf(βt)ϵ|| \alpha\nabla_\beta f(\beta^t)|| \leq \epsilon where ϵ\epsilon is some small scalar.

To get the gradient at each iteration we have to solve

βf(βt)=X(Xβty)\nabla_\beta f(\beta^t) = X^\top (X\beta^t - y)
  • Solving XβtX\beta^t is O(nd)O(nd)
  • Solving XβtyX\beta^t - y is O(n)O(n)
  • Solving X(Xβty)X^\top (X\beta^t - y) is O(nd)O(nd)

Therefore, the cost of gradient descent at each iteration is O(nd)O(nd). So if we have tt total iterations, the total cost of gradient descent is O(ndt)O(ndt).

The time complexity for solving OLS estimator is O(nd2+d3)O(nd^2 + d^3). So if dd is big, gradient descent may be a better alternative. Note that as nn or dd grow, then α\alpha must be smaller for gradient descent to work.

Stochastic Gradient Descent (SGD)

Since cost of gradient descent at each iteration is O(nd)O(nd), cost of computation at each iteration becomes expensive as nn grows.

Calculating the gradient using all nn samples is the same as calculating the gradient using the average sample.

f(β)=i=1nfi(β)=1ni=1nfi(β)\begin{align*} f(\beta) & = \sum_{i=1}^n f_i(\beta) \\& = \frac{1}{n}\sum_{i=1}^n f_i(\beta) \end{align*}
E(βf(βt))=i=1nP(i)βf(βt)=i=1n1nβf(βt)=1ni=1nβf(βt)=βf(βt)\begin{align*} E\left(\nabla_\beta f(\beta^t)\right) & = \sum_{i=1}^nP(i)\nabla_\beta f(\beta^t) \\& = \sum_{i=1}^n\frac{1}{n}\nabla_\beta f(\beta^t) \\& = \frac{1}{n}\sum_{i=1}^n\nabla_\beta f(\beta^t) \\& = \nabla_\beta f(\beta^t) \end{align*}

Therefore, if we draw sub-samples randomly at each iteration we should get the same result as gradient descent; the gradient at each iteration will point to the right direction as the actual gradient on average.

Var(βf(βt))=E[(βf(βt)E(βf(βt)))2]=1ni=1n(βf(βt)βf(βt))2\begin{align*} Var\left(\nabla_\beta f(\beta^t)\right) & = E\left[\left(\nabla_\beta f(\beta^t) - E\left(\nabla_\beta f(\beta^t)\right)\right)^2\right] \\& = \frac{1}{n}\sum_{i=1}^n\left(\nabla_\beta f(\beta^t) - \nabla_\beta f(\beta^t)\right)^2 \end{align*}
  • If Variance is 00, every step goes in the right direction.
  • If Variance is small, most steps point in the right direction.
  • If variance is large, many steps will point in the wrong direction; This is when βt\beta^t is close to the global minimizer.

As step size α\alpha increases variance increases as well. So Stochastic Gradient Descent can make an improvement by

  • Have α\alpha be big when βt\beta^t is far from global minimizer
  • Have α\alpha be small when βt\beta^t is close to global minimizer

Let αt\alpha^t be the step size at ttth iteration.

Stochastic gradient converges to a stationary point if

t=1(αt)2t=1αt=0\frac{\sum_{t=1}^\infty ({\alpha^t})^2}{\sum_{t=1}^\infty \alpha^t}=0

A solution for decreasing step sizes would be αt=O(1t)\alpha^t = O\left(\frac{1}{\sqrt{t}}\right). If we don't really care about getting the exact solution we can just have αt=α\alpha^t = \alpha for all tt.

In stochastic gradient descent we are not guaranteed to reach a gradient that will exactly equal 00. So in practice we can measure the validation set error after some iterations and stop the algorithm when the validation set error isn't improving.

There are multiple different ways where we can choose α\alpha as well, such as Adam, batch-norm, min-batch etc.

Automatic Differentiation (AD)

Consider a function where the input is a function and the output is the derivative. This is Automatic Differentiation (AD), which is used when running gradient descents.

In deeper neural networks, the computation of the derivatives can get quite tedious, so instead of computing the gradient by hand use AD.

The idea behind AD is using chain rule from calculus.

Back Propagation

A dynamic programming approach to solving the chain rule is back propagation. There are 2 steps: forward path and backward path.

  • For each hidden layer, compute all forward paths W1x,W2h(W1x),,vh()W_1x, W_2h(W_1x), \dots, v^\top h(\dots).
  • Compute the backward paths using chain rule and store all the intermediate gradients starting from fv\nabla f_{v}, then all the way until fW1\nabla f_{W_1}.

Backpropagation requires significant storage capacity to hold all intermediate results during the computation process. Additionally, the computational cost of calculating the gradient is equivalent to the cost of computing the function itself.

Evolutionary Strategy

CMA-ES algorithm works well for minimizing the following type of functions:

  • non-linear, non-quadratic, non-convex
  • non-smooth, discontinuous, multi-modal
  • high dimensional

Stochastic Search

Idea of evoluationary strategy algorithms for minimizing function f:RnRf: R^n \to R. Initial parameter θ\theta

  1. Draw λ\lambda samples from distribution P(xθ)P(x\mid \theta): x1,,xλx_1, \dots, x_\lambda
  2. Evaluate each λ\lambda samples on ff.
  3. Update the parameter θ\theta depending on Fθ(θ,x1,,xλ,f(x1),,f(xλ))F_\theta(\theta, x_1,\dots, x_\lambda, f(x_1),\dots,f(x_\lambda))