Db2

 View Only

Step-by-Step Guide to Building a Linear Regression Model inside IBM Db2

By SHAIKH Quader posted Tue May 07, 2024 05:22 PM

  
Data management

Despite being one of the earlier machine learning techniques, linear regression continues to be a top choice among ML practitioners for a regression task. For the past three years, over 80% of the respondents to Kaggle’s annual state of data science and machine learning survey mentioned linear regression as a ML algorithm they most frequently use. IBM Db2 provides an in-database stored procedure (SP) for Linear Regression as part of its ML library, which is a collection of over 200 SPs for performing different ML tasks in the database. Using the linear regression SP and other functionality of DB2’s ML Library, ML practitioners can build and deploy linear regression models in the database when their ML dataset is available in a Db2 database. In this post, I will show you the following steps of building and using a linear regression pipeline using SQL with a Db2 database:

Let’s begin.

The Regression Task

In this exercise, I will use the GoSales dataset, which is available from this link. The dataset has 60252 synthetic customers’ profile and their purchase amount at an imaginary outdoor equipment store. The following is the list of columns in the dataset:

I want to learn a multiple linear regression function of the following equation form that will use as input the first four columns — AGE, MARITAL_STATUS, and PROFESSION — and predict the PURCHASE_AMOUNT.

Using the training examples, the linear regression algorithm will learn the values of the following five parameters — the first one is the intercept and the remaining are four coefficients, one per input column.

Following these steps, I created a Db2 table with the name GOSALES_FULL under GOSALES schema and loaded the dataset into it.

Train / Test Split

First, I will divide the records from the GOSALES_FULL table into two partitions: a training partition and a test partition. For making these partitions, I will call the SPLIT_DATA stored procedure (SP) as follows.

CALL IDAX.SPLIT_DATA('intable=GOSALES.GOSALES_FULL, id=ID, traintable=GOSALES.GOSALES_TRAIN, testtable=GOSALES.GOSALES_TEST, fraction=0.8, seed=1')

These are the parameters I passed to this SP: (1) intable: the input table from which I want to create partitions. The value is GOSALES_FULL; (2) id: the name of the id column in the input table; (3) traintable: the name I want to give to the table that will have the training records; (4) testtable: the name of the output table where the SP will store the test records; (5) fraction: the portion of the records that I want in the training partition; (6) seed: I can set a value for reproducibility of the same partitions.

The SP will randomly divide records from the GOSALES_FULL table into two output tables: GOSALES_TRAIN and GOSALES_TEST. The following two SQL statements will count the number of records in these two tables.

SELECT count(*) FROM GOSALES.GOSALES_TRAIN
SELECT count(*) FROM GOSALES.GOSALES_TEST

The above counts confirm that the train and test tables have 80% and 20% of records, respectively, from the original table with 60252 records.

Data Exploration

Now, I will look into some sample records from the training dataset, GOSALES_TRAIN.

SELECT * FROM GOSALES.GOSALES_TRAIN FETCH FIRST 5 ROWS ONLY

From the above sample records, I get a feel for the customer records I will work with. Next, I will generate summary statistics of the entire training set using SUMMARY1000 SP:

CALL IDAX.SUMMARY1000('intable=GOSALES.GOSALES_TRAIN, outtable=GOSALES.GOSALES_TRAIN_SUM1000, incolumn=GENDER;AGE;MARITAL_STATUS;PROFESSION')

In this SP call, intable parameter has the name of the training table from which I wanted to gather summary statistics. outtable parameter has the name of the output table where I want the SP to save the summary statistics of the entire training table. In the incolumn parameter, I list the columns whose statistics I want to collect.

I am calling this SP with these parameters: (1) intable: name of the table whose sumary statistics I want to gather, the training partition in this case; (2) outtable: name of the table where I want the SP to store the overall summary statistics, (3) incolumn: list of columns whose statistics I want to collect.

SUMMARY1000 SP saves the collected statistics from the training table in three output tables: (1) GOSALES_TRAIN_SUM1000: has summary statistics of all columns listed in the incolumn parameter; (2) GOSALES_TRAIN_SUM1000_NUM: has summary statistics of only the numeric columns from the incolumn parameter, (3) GOSALES_TRAIN_SUM1000_CHAR: has summary statistics of categorical columns from the incolumn parameter.

For the ease of viewing, I will look at the summary statistics of each column type separately — first the numeric type and then nominal type.

SELECT * FROM GOSALES.GOSALES_TRAIN_SUM1000_NUM

The dataset has one numeric feature column, AGE. This summary table has provides a range of statistics, such as mean, variance, skewness. Also, the table reports that the AGE column has 1878 missing values.

Similarly, I find the following summary statistics from the GOSALES.GOSALES_TRAIN_SUM1000_CHAR table.

SELECT * FROM GOSALES.GOSALES_TRAIN_SUM1000_CHAR

From this summary, I can see that all three nominal columns have some missing values.

Data Preprocessing

From data exploration, I identified two kinds of data preprocessing tasks: handling missing values and dealing with nominal columns. In this step, I will address both tasks.

Dealing with missing values

I will replace missing values in the training dataset using IMPUTE_DATA SP. This SP supports replacing missing values with one of the four methods: mean, median, most frequent value, or a constant. All four methods work with numeric columns, whereas only the last two methods apply to nominal columns.

AGE column: I will replace missing values in the AGE column with the mean age value:

CALL IDAX.IMPUTE_DATA('intable=GOSALES.GOSALES_TRAIN, incolumn=AGE, method=mean')

In the above SP call, I passed three parameters: (1) intable: the name of the input table, which is GOSALES_TRAIN, (2) incolumn: the name of the column for imputing missing values. AGE is the column name. (3) method: a supported imputation strategy. I chose mean.)

Similarly, the following SP calls will replace missing values in GENDER, MARITAL_STATUS, and PROFESSION columns with the most frequent value of each column. I looked up the most frequent values of each column from the GOSALES_TRAIN_SUM1000_CHAR table that was created during the data exploration step.

Imputing missing values in the GENDER column:

CALL IDAX.IMPUTE_DATA('intable=GOSALES.GOSALES_TRAIN, method=replace, nominalValue=M, incolumn=GENDER')

Imputing missing values in the MARITAL_STATUS column:

CALL IDAX.IMPUTE_DATA('intable=GOSALES.GOSALES_TRAIN, method=replace, nominalValue=Married, incolumn=MARITAL_STATUS')

Imputing missing values in the PROFESSION column:

CALL IDAX.IMPUTE_DATA('intable=GOSALES.GOSALES_TRAIN, method=replace, nominalValue=Other, incolumn=PROFESSION')

By running the following SQL statement, I can confirm that the AGE column in the training dataset is now free of any missing values:

SELECT count(*) FROM GOSALES.GOSALES_TRAIN WHERE AGE IS NULL

Similarly, the following statements will count missing values in GENDER, MARITAL_STATUS, and PROFESSION columns. All of them no longer have any missing value.

Count missing values in the GENDER COLUMN:

SELECT count(*) FROM GOSALES.GOSALES_TRAIN WHERE GENDER IS NULL

Count missing values in the MARITAL_STATUS column:

SELECT count(*) FROM GOSALES.GOSALES_TRAIN WHERE MARITAL_STATUS IS NULL

Count missing values in the PROFESSION column:

SELECT count(*) FROM GOSALES.GOSALES_TRAIN WHERE PROFESSION IS NULL

Dealing with nominal columns

Linear regression algorithm requires that all its input columns are numeric. So, before invoking a linear regression algorithm, ML practitioners convert any non-numeric columns to numbers following some encoding scheme — such as 1-hot encoding. The linear regression SP of Db2 can natively handle nominal columns. When the input table has any nominal column, the SP internally converts it into a set of numeric columns using 1-hot encoding. So, I can leave the three nominal columns — GENDER, MARITAL_STATUS, and PROFESSION — as-is and let the SP take care of them.

Now, the GOSALES_TRAIN dataset is ready for model training.

Model Training

The following call to the LINEAR_REGRESSION SP will begin training of a linear regression model using training examples from the GOSALES_TRAIN table. The SP will use the list of columns mentioned in the incolum paramter as input features and the column from the target parameter as the output column. Since the intercept parameter is set to true, the SP will learn the value of intercept.

CALL IDAX.LINEAR_REGRESSION('model=GOSALES.GOSALES_LINREG, intable=GOSALES.GOSALES_TRAIN, id=ID, target=PURCHASE_AMOUNT,incolumn=AGE;GENDER;MARITAL_STATUS;PROFESSION, intercept=true')

After the training completes, the SP will add the new model, GOSALES_LINREG, to Db2’s model catalog. The following SP call will show the list of existing models in the catalog, which now includes my new model.

CALL IDAX.LIST_MODELS('format=short, all=true')

Additionally, the LINEAR_REGRESSION SP saves the learned values of intercept and the coefficients, along with several other learned parameter values, in a table. The table’s name takes this form: MODELNAME_MODEL. For the GOSALES_LINREG model, the name of its metadata table is GOSALES_LINREG_MODEL.

The following SQL will display the values of the learned model parameters.

SELECT VAR_NAME, LEVEL_NAME, VALUE FROM GOSALES.GOSALES_LINREG_MODEL

You may have noticed that the above output has more feature columns than what I originally had in the training set. Because LINEAR_REGRESSION SP has applied 1-hot encoding on the nominal columns, the final feature set has one feature for each distinct value in the nominal columns.

Generating Predictions with the Model

In this step, I will use the GOSALES_LINREG model to predict the purchase amount of the customers in the GOSALE_TEST table. Before generating predictions, I will preprocess this dataset using the preprocessing steps I had used with the training dataset, which were imputing missing values. I will use the following SQL statements to impute missing values in AGE, GENDER, MARITAL_STATUS, and PROFESSION columns of the test dataset.

Imputing missing values in the AGE column:

CALL IDAX.IMPUTE_DATA('intable=GOSALES.GOSALES_TEST, method=mean, incolumn=AGE')

Imputing missing values in the GENDER column:

CALL IDAX.IMPUTE_DATA('intable=GOSALES.GOSALES_TEST, method=replace, nominalValue=M, incolumn=GENDER')

Imputing missing values in the MARITAL_STATUS column:

CALL IDAX.IMPUTE_DATA('intable=GOSALES.GOSALES_TEST, method=replace, nominalValue=Married, incolumn=MARITAL_STATUS')

Imputing missing values in the PROFESSION column:

CALL IDAX.IMPUTE_DATA('intable=GOSALES.GOSALES_TEST, method=replace, nominalValue=Other, incolumn=PROFESSION')

Now the test dataset is ready to be passed as input to the model.

PREDICT_LINEAR_REGRESSION SP will use GOSALES_LINREG model to generate predictions for records in the GOSALES_TEST table. The SP will save the predictions in the GOSALES_TEST_PREDICTIONS table, as per the table name specified in the outtable parameter.

CALL IDAX.PREDICT_LINEAR_REGRESSION('model=GOSALES.GOSALES_LINREG, intable=GOSALES.GOSALES_TEST, outtable=GOSALES.GOSALES_TEST_PREDICTIONS, id=ID')

The following SQL displays sample predictions from the GOSALES_TEST_PREDICTIONS table.

SELECT * FROM GOSALES.GOSALES_TEST_PREDICTIONS FETCH FIRST 5 ROWS ONLY

Model Evaluation

For the test customer records, the actual purchase price is available at the GOSALES_TEST table and the predicted purchase price is in the GOSALES_TEST_PREDICTIONS table. Based on these actual and the predicted purchase amounts, I can now compute Mean Squared Error (MSE), Mean Absolute Error (MAE), and Mean Absolute Percent Error (MAPE) via SQL. This will help me evaluate the model’s predictive performance.

MSE:

I will use MSE SP to compute the MSE:

CALL IDAX.MSE('intable=GOSALES.GOSALES_TEST, id=ID, target=PURCHASE_AMOUNT, resulttable=GOSALES.GOSALES_TEST_PREDICTIONS, resulttarget=PURCHASE_AMOUNT')

MAE:

I will use MAE SP to calculate MAE:

CALL IDAX.MAE('intable=GOSALES.GOSALES_TEST, id=ID, target=PURCHASE_AMOUNT, resulttable=GOSALES.GOSALES_TEST_PREDICTIONS, resulttarget=PURCHASE_AMOUNT')

MAPE:

I wrote the following SQL to compute MAPE:

SELECT avg(abs(A.PURCHASE_AMOUNT - B.PURCHASE_AMOUNT) / A.PURCHASE_AMOUNT * 100) AS MAPE FROM GOSALES.GOSALES_TEST AS A, GOSALES.GOSALES_TEST_PREDICTIONS AS B WHERE A.ID = B.ID

Dropping the Model

If I want to drop this model, I can use the DROP_MODEL SP:

CALL IDAX.DROP_MODEL('model=GOSALES.GOSALES_LINREG')

Conclusion

In this exercise, I built, evaluated, and deployed an end-to-end linear regression pipeline using 29 simple SQL queries — 17 of them are calls to SPs provided by Db2, and the remaining ones were SELECT statements. In this ML workflow, I didn’t need to bring any data outside of the database, nor did I need separate infrastructure for developing, training, and serving ML models. For many companies, machine learning in the database can be a cost effective and quicker path to embrace AI.

Related Resources

For creating an in-database ML pipeline using my above steps, you can download the GoSales dataset from this link and follow instructions on this page to set up a Db2 database.

To learn more about Db2’s in-database ML library, check out Db2 production documentation. To find more examples of creating in-database ML models with Db2, check out this GitHub repo.


#Featured-area-2
0 comments
74 views

Permalink