IBM Z and LinuxONE - Solutions

Solutions

IBM Z and LinuxONE platforms are built for cloud, ready for blockchain, optimized for machine learning, open for DevSecOps, and highly secured.

View Only

Back to Blog List

An Insight into the Simple Linear Regression in Machine Learning

By Subhasish Sarkar (SS) posted Fri May 01, 2020 08:08 AM

Linear regression is the approximation of a linear model used to describe the relationship between two or more variables. We can use linear regression to predict a continuous value, by using other variables. In simple linear regression, there are two variables: a dependent variable and an independent variable. Linear Regression is the easiest and most basic regression to use and understand; it is fast and highly interpretable. It also doesn’t require the tuning of parameters. For example, tuning the K parameter in the K-Nearest Neighbors or the learning rate in Neural Networks isn’t something to be worried about in Linear Regression.

For an overview of Linear Regression in Machine Learning, please go through my article titled “Regression in Machine Learning” at https://community.ibm.com/community/user/ibmz-and-linuxone/blogs/subhasish-sarkar1/2020/03/14/regression-in-machine-learning.

In order to understand linear regression, let us consider a hypothetical example of estimating the approximate CO2 emission from a new car model after its production, with the variable ‘Engine Size of the Car’ being an independent variable, and ‘CO2 Emission’ as the target value that we would like to predict. Let us assume that we have the following dataset.

Let us first plot our independent and dependent variables using a scatter plot. With linear regression, we can fit a line through the data. The whole objective of linear regression is to find a line that is a good fit of the data in hand. Good fit, here, means that if we have, for instance, a car with engine size x1=3.9, and actual CO2 Emission=350, the CO2 emission for a new or unknown car model should be predicted very close to the actual value of 350.

The fit line is shown traditionally as a polynomial, which, for a simple regression problem with a single independent variable, would be

In this equation, ŷ (usually called y-hat) is the dependent variable or the variable whose value we are to predict, and x1 is the independent variable. θo and θ1 are called the coefficients of the linear equation. θ1 is known as the "slope" or "gradient" of the fitting line and θo is known as the "intercept".

Linear regression estimates the coefficients θo and θ1 of the line. In linear regression, we calculate θo and θ1 to find the best line to ‘fit’ the data. But, how do we do that? Let us assume for a moment that we have already found the best fit line for our data.

The green dotted lines represent the actual CO2 Emission value. The orange dotted lines represent the predicted CO2 Emission value using the fit line. Now, if we compare the actual value of the emission of the car with what we have predicted, we will find out that we have a (340-250) = 90-unit error (also called the residual error), represented by the red bi-directional arrow. Error = y - ŷ = 250 – 340 = -90. This error is only for a single data point in our dataset. A 90-unit error inevitably points to the fact that our prediction line is not very accurate. What we usually do is that we calculate the mean of all residual errors for all the data points in our dataset and the goal is to find a line where the mean of all residual errors is as minimized as possible. Technically speaking, we consider what is called the Mean Squared Error (MSE), mathematically represented by the following equation.

Therefore, the objective of linear regression is to minimize the MSE equation by finding the best parameters, θo and θ1. θo and θ1 can be calculated using the following equations.

Quite evidently, we calculate the average of x1 and average of y. Next, we use those values to find θ1. Once we have the value of θ1, calculating θo is child’s play. We really don’t need to remember the formula for calculating the parameters θo and θ1; most of the libraries in Python, R, and Scala used for machine learning can easily calculate the parameters for us. However, it is always good to understand how everything works.

Suppose, in our case, θo=100 and θ1=25. ŷ = 100 + 25 * x₁Now, let us imagine that we need to predict the CO2 Emission (y) from EngineSize (x) for an Automobile having an EngineSize 3.3.
ŷ =100 + 25 * x₁

=> CO2Emission = 100 + 25*EngineSize = 100 + 25 × 3.3 = 182.5

Thus, we have predicted that the CO2 Emission for our specific car in consideration is 182.5.

0 comments

21 views

Permalink

https://community.ibm.com/community/user/blogs/subhasish-sarkar1/2020/05/01/simple-linear-regression-machine-learning-insight

IBM Z and LinuxONE - Solutions

Solutions

An Insight into the Simple Linear Regression in Machine Learning

By Subhasish Sarkar (SS) posted Fri May 01, 2020 08:08 AM

Permalink

Additional
Resources

Office

Quick Links

IBM Z and LinuxONE - Solutions

Solutions

An Insight into the Simple Linear Regression in Machine Learning

By Subhasish Sarkar (SS) posted Fri May 01, 2020 08:08 AM

Permalink

Additional Resources

Office

Quick Links

Additional
Resources