IBM Z and LinuxONE - Solutions

Solutions

Solutions

IBM Z and LinuxONE platforms are built for cloud, ready for blockchain, optimized for machine learning, open for DevSecOps, and highly secured.

 View Only

An Insight into the Simple Linear Regression in Machine Learning

By Subhasish Sarkar (SS) posted Fri May 01, 2020 08:08 AM

  
Linear regression is the approximation of a linear model used to describe the relationship between two or more variables. We can use linear regression to predict a continuous value, by using other variables. In simple linear regression, there are two variables: a dependent variable and an independent variable. Linear Regression is the easiest and most basic regression to use and understand; it is fast and highly interpretable. It also doesn’t require the tuning of parameters. For example, tuning the K parameter in the K-Nearest Neighbors or the learning rate in Neural Networks isn’t something to be worried about in Linear Regression.

For an overview of Linear Regression in Machine Learning, please go through my article titled “Regression in Machine Learning” at https://community.ibm.com/community/user/ibmz-and-linuxone/blogs/subhasish-sarkar1/2020/03/14/regression-in-machine-learning.

In order to understand linear regression, let us consider a hypothetical example of estimating the approximate CO2 emission from a new car model after its production, with the variable ‘Engine Size of the Car’ being an independent variable, and ‘CO2 Emission’ as the target value that we would like to predict. Let us assume that we have the following dataset.

Dataset
Dataset


Let us first plot our independent and dependent variables using a scatter plot. With linear regression, we can fit a line through the data. The whole objective of linear regression is to find a line that is a good fit of the data in hand. Good fit, here, means that if we have, for instance, a car with engine size x1=3.9, and actual CO2 Emission=350, the CO2 emission for a new or unknown car model should be predicted very close to the actual value of 350.

Scatter Plot
Scatter Plot

The fit line is shown traditionally as a polynomial, which, for a simple regression problem with a single independent variable, would be

Polynomial Equation
Polynomial Equation


In this equation,  ŷ (usually called y-hat) is the dependent variable or the variable whose value we are to predict, and x1 is the independent variable. θo and θ1 are called the coefficients of the linear equation. θ1 is known as the "slope" or "gradient" of the fitting line and θo is known as the "intercept".

Slope Intercept Form
Slope Intercept Form


Linear regression estimates the coefficients θo and θ1 of the line. In linear regression, we calculate θo and θ1 to find the best line to ‘fit’ the data. But, how do we do that? Let us assume for a moment that we have already found the best fit line for our data.

Fit Line
Fit Line


The green dotted lines represent the actual CO2 Emission value. The orange dotted lines represent the predicted CO2 Emission value using the fit line. Now, if we compare the actual value of the emission of the car with what we have predicted, we will find out that we have a (340-250) = 90-unit error (also called the residual error), represented by the red bi-directional arrow. Error = y - ŷ = 250 – 340 = -90. This error is only for a single data point in our dataset. A 90-unit error inevitably points to the fact that our prediction line is not very accurate. What we usually do is that we calculate the mean of all residual errors for all the data points in our dataset and the goal is to find a line where the mean of all residual errors is as minimized as possible. Technically speaking, we consider what is called the Mean Squared Error (MSE), mathematically represented by the following equation.

Mean Squared Error (MSE) Equation
Mean Squared Error (MSE) Equation


Therefore, the objective of linear regression is to minimize the MSE equation by finding the best parameters, θo and θ1. θo and θ1 can be calculated using the following equations.

θo and θ1 calculation
θo and θ1 calculation


Quite evidently, we calculate the average of x1 and average of y. Next, we use those values to find θ1. Once we have the value of θ1, calculating θo is child’s play. We really don’t need to remember the formula for calculating the parameters θo and θ1; most of the libraries in Python, R, and Scala used for machine learning can easily calculate the parameters for us. However, it is always good to understand how everything works.

Suppose, in our case, θo=100 and θ1=25. ŷ = 100 + 25 * x1

Now, let us imagine that we need to predict the CO2 Emission (y) from EngineSize (x) for an Automobile having an EngineSize 3.3.
ŷ =100 + 25 * x1

=> CO2Emission = 100 + 25*EngineSize = 100 + 25 × 3.3 = 182.5

Thus, we have predicted that the CO2 Emission for our specific car in consideration is 182.5.

0 comments
21 views

Permalink