IBM Z and LinuxONE - Solutions - Group home

Implementation of Simple Linear Regression in Machine Learning using Python

  
This technical article demonstrates, in a step-by-step manner, how to practically implement ‘Simple Linear Regression’ in Machine Learning, using Python as the programming language. Readers going through the article are expected to have a fair amount of understanding of how coding in Python, Pandas and Numpy works, besides being expected to have an understanding of the theory behind ‘Simple Linear Regression’ in Machine Learning and the mathematics involved.

Let us start by importing all the standard packages that we are going to need – Pandas, Numpy and matplotlib.pyplot.

Importing standard Packages


We will be using the ‘BOSTON Housing’ Dataset – this dataset contains information about the different houses in Boston. We will access the dataset from the scikit-learn library. There are 506 data samples and 13 feature variables in the dataset. Our objective is to predict the value of the prices of the houses using the given features.

Importing the Standard "BOSTON Housing" Dataset from sklearn library

What does the BOSTON Housing Dataset contain?

Here, we are trying to see what the BOSTON Housing’ Dataset contains.

  • data: contains the information for the various houses
  • target: price of the house
  • feature_names: names of the features
  • DESCR: describes the dataset

Print-data.DESCR


Print 'data.DESCR' - Output1

Print 'data.DESCR' - Output2


The price of the houses is indicated by the variable MEDV and that is our target variable and the remaining are the feature variables, based on which we will predict the value of a house. Now, let us load the data into a pandas dataframe. We will print the 13 feature variables first – we will only print the first 5 rows of the data.

Pandas Dataframe Feature Variables


Next, we will print the single target variable MEDV – and, once again, we will only print the first 5 rows of the data.

Pandas Dataframe Target Variable


Let us print the histogram of the target house price variable.

Target Variable Histogram


We can also plot scatter diagrams with each of the 13 feature variables being compared with the target house price variable MEDV.

Scatter Plot Diagrams - Code

Scatter Plot Diagrams - 1

Scatter Plot Diagrams - 2

Scatter Plot Diagram - 3


We will now create a correlation matrix because that will greatly help us understand and measure the linear relationships that exist between the variables. The correlation coefficient ranges from -1 to 1. If the value is close to 1, it means that there is a strong positive correlation between the two variables. When the value is close to -1, the variables have a strong negative correlation.

Correlation Matrix - Code

Correlation Matrix - Output


If we take a careful look at the correlation matrix, we can see that the feature variable ‘RM’ has a strong positive correlation (0.7) with the target variable ‘MEDV’, whereas ‘LSTAT’ has a high negative correlation (-0.74) with ‘MEDV’. As a general convention for selecting features to fit a linear regression model, we usually select those feature variables which have a high correlation with the target variable. Based on our observation of the correlation matrix, the feature variables ‘RM’ and ‘LSTAT’ can easily be selected. Let us look at the scatter plot diagrams once again to see how the feature variables ‘RM’ and ‘LSTAT’ vary with the target variable ‘MEDV’.

Scatter Plot Diagrams - 4


It can be inferred from the scatter plot diagrams that
> The price of a house in Boston increases as the value of the feature variable ‘RM’ increases and we can see a positive linear correlation between the variables ‘RM’ and ‘MEDV’.
> The price of a house in Boston tend to decrease with an increase in the value of the feature variable ‘LSTAT’. There is a negative correlation between the variables ‘RM’ and ‘MEDV’; the correlation is not exactly linear, though.


We will now concatenate the feature variables ‘RM’ and ‘LSTAT’.

Concatenate the feature variables ‘RM’ and ‘LSTAT’


The next step is to split the data into TRAINING and TESTING sets. Here, we have decided to train our Machine Learning Simple Linear Regression Model with 80% of the sample data present in the ‘BOSTON Housing’ Dataset and test the model with the remaining 20% sample data.

Split the data into TRAINING and TESTING sets


We will train our model now.

Train the Model


And, now, we are going to evaluate our model – we will use Root Mean Squared Error, popularly referred to as RMSE.

Evaluate the Model


We can now plot the prediction line.

Plot the prediction line


And, finally, let us look at how good a prediction our Machine Learning Simple Linear Regression Model makes.

Model prediction


So, we have looked at how you can practically implement ‘Simple Linear Regression’ in Machine Learning, using Python as the programming language. You can find the Jupyter Notebook (saved in .html format) at https://github.com/SubhasishSarkarIndia/JupyterNotebook.git.


************************************************************************************************************************************
***********************************************************************************************************************************
The author of the technical article, Subhasish Sarkar (SS), is an IBM Z Champion for 2020.
************************************************************************************************************************************
***********************************************************************************************************************************