Gradient Descent
Topic
Tags
Gradient Descent (GD) is an algorithm that is often used in optimization applications. While performing gradient descent a minimum is seeked of a function that describes the cost of a process, the so-called loss/cost function.
A loss function is a function that describes the so-called loss of a process such as the cost of computation time of distance. So, by finding a minimum of these functions a minimum amount of loss is realized at that point. Therefore, gradient descent leads to the desired optimality.
Figure 1. Visualization of the gradient descent algorithm
In words, this algorithm is applied by taking derivatives of the loss function. At each current position on the graph, the gradient is computed and a step is taken in the negative direction of the largest gradient. Since the derivative describes the rate of change of a function. A large gradient equals a large difference between two adjacent points.
After the first step is taken, the gradient is computed again and the step is similarly taken towards the negative direction of the largest gradient. This is an iterative process where eventually a minimum is reached as can be seen in the figure above this section
Math
In formulas, the gradient descent algorithm can be described as
with being the value at the current time-step , the learning rate, and the gradient of the function at .
Since, this algorithm is an iterative process it needs to make iterations, which is specified by the user. A chosen can be considered right if the algorithm converges to the minimum in a considerable amount of time (with an appropriate learning rate ).
Learning Rate
Important to note is that the learning rate is the value that can put more or less 'weight' on the gradient. By changing the magnitude of the learning rate different results for the gradient descent are obtained. A qualitative visualization is shown below
Figure 2. Visualization of the gradient descent algorithm
As can be seen in the above figure a small learning rate leads to small steps per iteration but will make each update step in the right direction. While the steps are small, it takes a large number of iterations, which is undesirable as the time until convergence is desired to be small.
Whereas the higher learning rate leads to unpredictable behavior as the update steps occur abruptly. Divergence can be a result of choosing a too-large learning rate.
Criteria
Now there is a basic understanding of how the gradient descent algorithm works it is time to elaborate on which criteria need to be met in order to successfully deploy this algorithm.
Differentiability
An important criterion for gradient descent is that the cost function needs to be continuously differentiable. As the name already shows the gradient needs to be computed, which is done by computing the derivatives. Now if a function appears not to be continuously differentiable, this would mean that at certain parts of the gradient, no solution exists. When this is the case steps can not be taken for updating the position. In Figure 3 a visualization is given on how to see if a function is continuously differentiable
Figure 3. Visualization of the gradient descent algorithm
In the figure, there can be seen that if a random line is drawn between points on the curve and the line is inside the corresponding curve it is differentiable. If this is not the case then the function is considered to be non-differentiable and thus the gradient descent algorithm can not be deployed.
Convexity
Convexity is another criterion for gradient descent, especially if an ultimate minimum needs to be found. If a function is convex then there exists only one minimum, visualized on the left side of Figure 3. On contrary, a non-convex function has multiple minima, with one ultimate minimum called the global minimum and the other being the local minimum.
Figure 4. Visualization of the gradient descent algorithm
If gradient descent is now performed on the non-convex function it can occur that it ends up in the local minimum as it moves to the negative direction of the largest gradient. This happens for example if the initial random starting location is chosen to be at the left part in the non-convex visualization. It moves down towards the local minimum, without knowing that there is an even better minimum near a visualization of this given in Figure 4.
Now there are other gradient descent techniques that can prevent getting stuck in the local minima such as gradient descent with momentum.