View Only

In the world of machine learning and AI, optimization algorithms are one of the most important tools used to train models to make correct predictions or decisions. Gradient Descent is one of the most widely used optimization algorithms used to reduce loss functions and optimise parameters.

In this blog, you will be introduced to Gradient Descent, its variations, and its mathematical bases.

At its core, gradient descent is an iterative algorithm designed to minimize a specific objective function. Whether you’re trying to minimize the error of a regression model or maximize the accuracy of a classification task, gradient descent offers a systematic approach to the parameter space to find the best solution.

The foundation of gradient descent is based on gradients. For calculus, a gradient represents the steepest gradient for a function. On the other hand, a negative gradient indicates the steepest gradient. Gradient descent iteratively changes parameters in the opposite direction of the gradient of a loss function. The goal of gradient descent is to converge to the minimum value of the function.

Let’s delve into the basic mathematical formulation of Gradient Descent. Suppose we have a parameter vector **θ** and a loss function** J(θ)** that we aim to minimize. The update rule for Gradient Descent can be expressed as follows:

Here, ** θt** denotes the parameter vector at iteration

The above equation shows how Gradient Descent iteratively updates parameters by moving in the opposite direction of the gradient.

Let’s consider a simple linear regression problem where we aim to minimize the mean squared error (MSE) loss function:

as the loss function.*L*(*w*)as the parameters (weights) of the model.*w*as the total number of data points.*N***(**as the*xi*,*yi*)*i**th*data point.as the predicted output given input*f*(*xi*,*w*)*xi* and parameters*w*.

In gradient descent, the parameter ( w ) is updated in the direction where the loss function is minimized. This is done by computing the loss function gradient of the parameters and then moving in the other direction.

We can compute the gradient of the loss function with respect to the parameters using the chain rule of calculus:

Here, **∇ L(w)** represents the gradient of the loss function, and

**Parameter Update**

After computing the gradient, we proceed to update the parameters using the following update rule:

In the equation, learning rate(**α**) is the hyper parameter that controls the size of the step in the gradient direction. Modifying the learning rate controls the convergence rate during training.

Gradient Descent exhibits several variants, each tailored to address specific challenges or improve convergence speed. Some notable variants include:

This method calculates the gradient using the whole data set. It ensures convergence to the global mean, but can be expensive to compute for large data sets.

This expression returns the average loss function gradient over all the data points, i.e., where **N **is the total data point number. It represents the direction and size of the adjustment needed for the parameter **w** to reduce the **L(w)** loss function.

Converts the gradient using a single sample. This method is much faster, but it introduces a lot of variability in parameter changes.

In this expression, **( xi,yi)** represents the

Balances Batch GD with SGD by calculating gradients using a mini-batch of the dataset. Provides both speed and consistency.

Here, **B** stands for batch size, i.e., how many data points are included in each of the mini-bits. This formula calculates the average gradient across the mini-bitts, which allows for a more consistent estimate of the actual gradient, while still being relatively efficient compared with stochastic gradients.

Incorporates momentum to accelerate convergence by accumulating gradients from previous steps.

Here, ** β** represents the momentum parameter, influencing the contribution of past gradients to the current step.

The parameters ** w** are then updated using the momentum term:

In this equation, the term **‘learning rate( α)’** is used to denote the size of the step in relation to the momentum-corrected

Regularization techniques like L1 or L2 regularisation are essential for avoiding overfitting and improving a model’s generalization.

** L1 regularization** (also called Lasso regularization) modifies the loss function by applying a penalty term proportional to the absolute values of the parameter weights.

Here, the parameter** λ **is the regularization parameter that controls the degree of regularization. **L1 regularization** promotes sparsity in parameter weights by reducing some parameter weights to zero, thus performing feature selection.

** L2 regularization** (also called Ridge regularization) adds a penalty term equal to the product of the squares of the parameter weights.

Similar to L1 regularization, *λ* is the regularization parameter. L2 regularization tends to encourage small weights across all parameters, effectively preventing them from becoming too large and thus reducing the model’s sensitivity to individual data points.

**L1 regularization** and **L2 regularization** in the training loss function reduce overfitting as models are prevented from learning overly complicated patterns from training data, resulting in better generalization performance on invisible data.

To sum up Gradient Descent is a key enabler in optimization algorithms, allowing machine learning models to process data and make real-time predictions. By understanding Gradient Descent’s mechanics, mathematical formulae, and variants we gain valuable insights on how to train models efficiently and navigate the intricate world of optimization. As artificial intelligence continues to evolve Gradient Descent continues to play an important role in our search for discovery and innovation.

0 comments

16 views