Implementation of Simple Linear Regression in Machine Learning using Python

Back to Blog List

Implementation of Simple Linear Regression in Machine Learning using Python

Like

This technical article demonstrates, in a step-by-step manner, how to practically implement ‘Simple Linear Regression’ in Machine Learning, using Python as the programming language. Readers going through the article are expected to have a fair amount of understanding of how coding in Python, Pandas and Numpy works, besides being expected to have an understanding of the theory behind ‘Simple Linear Regression’ in Machine Learning and the mathematics involved.

Let us start by importing all the standard packages that we are going to need – Pandas, Numpy and matplotlib.pyplot.

We will be using the ‘BOSTON Housing’ Dataset – this dataset contains information about the different houses in Boston. We will access the dataset from the scikit-learn library. There are 506 data samples and 13 feature variables in the dataset. Our objective is to predict the value of the prices of the houses using the given features.

Importing the Standard "BOSTON Housing" Dataset from sklearn library

What does the BOSTON Housing Dataset contain?

Here, we are trying to see what the BOSTON Housing’ Dataset contains.

data: contains the information for the various houses
target: price of the house
feature_names: names of the features
DESCR: describes the dataset

The price of the houses is indicated by the variable MEDV and that is our target variable and the remaining are the feature variables, based on which we will predict the value of a house. Now, let us load the data into a pandas dataframe. We will print the 13 feature variables first – we will only print the first 5 rows of the data.

Next, we will print the single target variable MEDV – and, once again, we will only print the first 5 rows of the data.

Let us print the histogram of the target house price variable.

We can also plot scatter diagrams with each of the 13 feature variables being compared with the target house price variable MEDV.

We will now create a correlation matrix because that will greatly help us understand and measure the linear relationships that exist between the variables. The correlation coefficient ranges from -1 to 1. If the value is close to 1, it means that there is a strong positive correlation between the two variables. When the value is close to -1, the variables have a strong negative correlation.

If we take a careful look at the correlation matrix, we can see that the feature variable ‘RM’ has a strong positive correlation (0.7) with the target variable ‘MEDV’, whereas ‘LSTAT’ has a high negative correlation (-0.74) with ‘MEDV’. As a general convention for selecting features to fit a linear regression model, we usually select those feature variables which have a high correlation with the target variable. Based on our observation of the correlation matrix, the feature variables ‘RM’ and ‘LSTAT’ can easily be selected. Let us look at the scatter plot diagrams once again to see how the feature variables ‘RM’ and ‘LSTAT’ vary with the target variable ‘MEDV’.

It can be inferred from the scatter plot diagrams that
> The price of a house in Boston increases as the value of the feature variable ‘RM’ increases and we can see a positive linear correlation between the variables ‘RM’ and ‘MEDV’.
> The price of a house in Boston tend to decrease with an increase in the value of the feature variable ‘LSTAT’. There is a negative correlation between the variables ‘RM’ and ‘MEDV’; the correlation is not exactly linear, though.

We will now concatenate the feature variables ‘RM’ and ‘LSTAT’.

Concatenate the feature variables ‘RM’ and ‘LSTAT’

The next step is to split the data into TRAINING and TESTING sets. Here, we have decided to train our Machine Learning Simple Linear Regression Model with 80% of the sample data present in the ‘BOSTON Housing’ Dataset and test the model with the remaining 20% sample data.

We will train our model now.

And, now, we are going to evaluate our model – we will use Root Mean Squared Error, popularly referred to as RMSE.

We can now plot the prediction line.

And, finally, let us look at how good a prediction our Machine Learning Simple Linear Regression Model makes.

So, we have looked at how you can practically implement ‘Simple Linear Regression’ in Machine Learning, using Python as the programming language. You can find the Jupyter Notebook (saved in .html format) at https://github.com/SubhasishSarkarIndia/JupyterNotebook.git.

************************************************************************************************************************************
***********************************************************************************************************************************
The author of the technical article, Subhasish Sarkar (SS), is an IBM Z Champion for 2020.
************************************************************************************************************************************
***********************************************************************************************************************************

IBM Z and LinuxONE - Solutions - Group home

Implementation of Simple Linear Regression in Machine Learning using Python