Optimization
May 07, 2024
Gradient Descent
Pseudo Code
- Start with an initial point for . Let this be
- Compute the gradient at to generate a better guess where is the step size.
- repeat this until where , or until where is some small scalar.
To get the gradient at each iteration we have to solve
- Solving is
- Solving is
- Solving is
Therefore, the cost of gradient descent at each iteration is . So if we have total iterations, the total cost of gradient descent is .
The time complexity for solving OLS estimator is . So if is big, gradient descent may be a better alternative. Note that as or grow, then must be smaller for gradient descent to work.
Stochastic Gradient Descent (SGD)
Since cost of gradient descent at each iteration is , cost of computation at each iteration becomes expensive as grows.
Calculating the gradient using all samples is the same as calculating the gradient using the average sample.
Therefore, if we draw sub-samples randomly at each iteration we should get the same result as gradient descent; the gradient at each iteration will point to the right direction as the actual gradient on average.
- If Variance is , every step goes in the right direction.
- If Variance is small, most steps point in the right direction.
- If variance is large, many steps will point in the wrong direction; This is when is close to the global minimizer.
As step size increases variance increases as well. So Stochastic Gradient Descent can make an improvement by
- Have be big when is far from global minimizer
- Have be small when is close to global minimizer
Let be the step size at th iteration.
Stochastic gradient converges to a stationary point if
A solution for decreasing step sizes would be . If we don't really care about getting the exact solution we can just have for all .
In stochastic gradient descent we are not guaranteed to reach a gradient that will exactly equal . So in practice we can measure the validation set error after some iterations and stop the algorithm when the validation set error isn't improving.
There are multiple different ways where we can choose as well, such as Adam, batch-norm, min-batch etc.
Automatic Differentiation (AD)
Consider a function where the input is a function and the output is the derivative. This is Automatic Differentiation (AD), which is used when running gradient descents.
In deeper neural networks, the computation of the derivatives can get quite tedious, so instead of computing the gradient by hand use AD.
The idea behind AD is using chain rule from calculus.
Back Propagation
A dynamic programming approach to solving the chain rule is back propagation. There are 2 steps: forward path and backward path.
- For each hidden layer, compute all forward paths .
- Compute the backward paths using chain rule and store all the intermediate gradients starting from , then all the way until .
Backpropagation requires significant storage capacity to hold all intermediate results during the computation process. Additionally, the computational cost of calculating the gradient is equivalent to the cost of computing the function itself.
Evolutionary Strategy
CMA-ES algorithm works well for minimizing the following type of functions:
- non-linear, non-quadratic, non-convex
- non-smooth, discontinuous, multi-modal
- high dimensional
Stochastic Search
Idea of evoluationary strategy algorithms for minimizing function . Initial parameter
- Draw samples from distribution :
- Evaluate each samples on .
- Update the parameter depending on