AI and DS Skills

 View Only

The Art of Optimization: Understanding Gradient Descent

By Danish Hasarat posted Tue March 26, 2024 08:51 PM



In the world of machine learning and AI, optimization algorithms are one of the most important tools used to train models to make correct predictions or decisions. Gradient Descent is one of the most widely used optimization algorithms used to reduce loss functions and optimise parameters.

In this blog, you will be introduced to Gradient Descent, its variations, and its mathematical bases.

Understanding Gradient Descent

At its core, gradient descent is an iterative algorithm designed to minimize a specific objective function. Whether you’re trying to minimize the error of a regression model or maximize the accuracy of a classification task, gradient descent offers a systematic approach to the parameter space to find the best solution.

The foundation of gradient descent is based on gradients. For calculus, a gradient represents the steepest gradient for a function. On the other hand, a negative gradient indicates the steepest gradient. Gradient descent iteratively changes parameters in the opposite direction of the gradient of a loss function. The goal of gradient descent is to converge to the minimum value of the function.

Basic Formulation:

Let’s delve into the basic mathematical formulation of Gradient Descent. Suppose we have a parameter vector θ and a loss function J(θ) that we aim to minimize. The update rule for Gradient Descent can be expressed as follows:

Here, θt​ denotes the parameter vector at iteration t, α represents the learning rate — a critical hyperparameter dictating the magnitude of each step, and  J(θt​) signifies the gradient of the loss function J(θ) evaluated at θt​. This succinct formulation encapsulates the essence of Gradient Descent, illustrating its iterative parameter updates driven by the negative gradient direction.

The above equation shows how Gradient Descent iteratively updates parameters by moving in the opposite direction of the gradient.

Mathematical Deep dive

Let’s consider a simple linear regression problem where we aim to minimize the mean squared error (MSE) loss function:

  • L(w) as the loss function.
  • w as the parameters (weights) of the model.
  • N as the total number of data points.
  • (xi​,yi​) as the ith data point.
  • f(xi​,w) as the predicted output given input xi​ and parameters w.

In gradient descent, the parameter ( w ) is updated in the direction where the loss function is minimized. This is done by computing the loss function gradient of the parameters and then moving in the other direction.

Gradient Calculation

We can compute the gradient of the loss function with respect to the parameters using the chain rule of calculus:

Here, L(w) represents the gradient of the loss function, and f (xi​,w) signifies the gradient of the model output with respect to the parameters. This expression captures the weighted sum of the gradients of the model outputs with respect to the parameters, weighted by the difference between the actual output yi and the predicted output f(xi​,w) across all data points.

Parameter Update

After computing the gradient, we proceed to update the parameters using the following update rule:

In the equation, learning rate(α) is the hyper parameter that controls the size of the step in the gradient direction. Modifying the learning rate controls the convergence rate during training.

Variants of Gradient Descent

Gradient Descent exhibits several variants, each tailored to address specific challenges or improve convergence speed. Some notable variants include:

1. Batch Gradient Descent:

This method calculates the gradient using the whole data set. It ensures convergence to the global mean, but can be expensive to compute for large data sets.

This expression returns the average loss function gradient over all the data points, i.e., where N is the total data point number. It represents the direction and size of the adjustment needed for the parameter w to reduce the L(w) loss function.

2. Stochastic Gradient Descent (SGD):

Converts the gradient using a single sample. This method is much faster, but it introduces a lot of variability in parameter changes.

In this expression, (xi​,yi​) represents the ith data point, and f(xi​,w) signifies the gradient of the model output with respect to the parameters. This formulae describes the immediate change in the loss function of a single data point and guides the parameter update to minimize the loss.

3. Mini-batch Gradient Descent:

Balances Batch GD with SGD by calculating gradients using a mini-batch of the dataset. Provides both speed and consistency.

Here, B stands for batch size, i.e., how many data points are included in each of the mini-bits. This formula calculates the average gradient across the mini-bitts, which allows for a more consistent estimate of the actual gradient, while still being relatively efficient compared with stochastic gradients.

4. Momentum-based Gradient Descent:

Incorporates momentum to accelerate convergence by accumulating gradients from previous steps.

Here, β represents the momentum parameter, influencing the contribution of past gradients to the current step.

The parameters w are then updated using the momentum term:

In this equation, the term ‘learning rate(α) is used to denote the size of the step in relation to the momentum-corrected gradient v. This method allows for a smoother and faster convergence to the optimal solution.

5. Regularization:

Regularization techniques like L1 or L2 regularisation are essential for avoiding overfitting and improving a model’s generalization.

L1 regularization (also called Lasso regularization) modifies the loss function by applying a penalty term proportional to the absolute values of the parameter weights.

Here, the parameter λ is the regularization parameter that controls the degree of regularization. L1 regularization promotes sparsity in parameter weights by reducing some parameter weights to zero, thus performing feature selection.

L2 regularization (also called Ridge regularization) adds a penalty term equal to the product of the squares of the parameter weights.

Similar to L1 regularization, λ is the regularization parameter. L2 regularization tends to encourage small weights across all parameters, effectively preventing them from becoming too large and thus reducing the model’s sensitivity to individual data points.

L1 regularization and L2 regularization in the training loss function reduce overfitting as models are prevented from learning overly complicated patterns from training data, resulting in better generalization performance on invisible data.


To sum up Gradient Descent is a key enabler in optimization algorithms, allowing machine learning models to process data and make real-time predictions. By understanding Gradient Descent’s mechanics, mathematical formulae, and variants we gain valuable insights on how to train models efficiently and navigate the intricate world of optimization. As artificial intelligence continues to evolve Gradient Descent continues to play an important role in our search for discovery and innovation.