Global AI and Data Science

Global AI & Data Science

Train, tune and distribute models with generative AI and machine learning capabilities

View Only

Back to Blog List

The Use and Practice of the IBM SPSS Generalized Linear Engine Algorithm

By Lei Gao posted Thu October 25, 2018 04:02 AM

Algorithm Introduction

Generalized linear models (GZLMs) have been commonly used analytical tools for different types of data for quite some time because they cover not only widely used statistical models, such as linear regression for normally distributed targets, logistic models for binary data, and log linear model for count data, but also many useful statistical models via its very general model formulation. Since those models are under the independence assumption, “Generalized Linear Engine” (GLE) is designed to build them for large and distributed data.

Model formation

A GZLM of the target y with predictor variables X and offset variable O has the form:

GLE Formula

where η is the linear predictor; O is an offset variable with a constant coefficient of 1 for each observation; g(.) is the monotonic differentiable link function which states how the mean of y is related to the linear predictor η; F is the target probability distribution. Choosing different combinations of a proper probability distribution and a link function can result in different models. Some combinations are well known models and have been provided in different SPSS procedures. The following table lists these combinations and corresponding SPSS procedures.

GLE will include 9 distributions which include 3 continuous ones: normal, inverse Gaussian, gamma; 5 discrete ones: binomial, Poisson, negative binomial, ordinal multinomial, nominal multinomial; and 1 mixed distribution: Tweedie.

The Combination of probability distribution and link function

Choosing different combinations of a proper probability distribution and a link function can result in different models. If improper combinations were specified, an error message will be issued.

Distribution_Link_Function_Combin

Note that the available distributions depend on the measurement type of the target and there are 4 different levels in the applications:

If a target is continuous, all distributions except nominal and ordinal multinomial would be allowed. Note that binomial is allowed because the target could be an “events” variable and the user has to also specify a “trials” variable. The default is the normal distribution.
If a target is nominal, then nominal multinomial and binomial distributions are allowed. The default is nominal multinomial.
If a target is ordinal, then ordinal, nominal and binomial distributions are allowed. The default is ordinal multinomial.
If a target is a flag, only the binomial distribution is allowed.

Model Selections

GLE supports model selection for generalized linear models, which involves 2 aspects:

Distribution and/or link function specification: if both or one of them is unspecified, then we need to select them which would be based on the measurement level and storage type of the target.
Variable selection or regularization: the option of variable selection and regularization can be on or off. If it is on, then the available methods are forward stepwise, lasso (L1 regularization), elastic net (L1+L2 regularization) and ridge regression (L2 regularization).

Where can you get the SPSS Generalized Linear Engine?

Product Integration with UI

IBM SPSS Modeler V18.0.0 or later

Harvested in the Product

IBM Watson Studio, all the SPSS algorithms are available in the Watson Studio (DSX), where the user should follow the DSX access policy.

API Documentation for Spark and Python

You can get the API Documentation for Spark and Python Here, this is the link to the API doc, it is online and should be free.

More Information & Use Cases

There is a video available to introduce you to the SPSS Generalized Linear Engine, in which more details are provided. You can watch the video on YouTube below.

On YouTube:

We also present two use cases below. One is the case from the video using IBM SPSS Modeler, while another is about modelling bike sharing by IBM Watson Studio notebook.

Use Case 1: Using GLE to assess credit risk with IBM SPSS Modeler

Eric,a bank loan officer, has a dataset with 700 customers, who were previously given loans. From the data, he wants to:

Identify the characteristics that are indicative of people who are likely to default on loans
Use those characteristics to identify good and bad credit risks
Receive advice on whether 150 incoming customers will default on a loan

Training data profile

There are 700 records in total in the training data and each record involves several features to describe the profile of one customer.
This is the introduction to the fields:

age: Customer age in years
ed: Level of education, which is a nominal field:
- “1” = Did not complete high school
- “2” = High school degree
- “3” = Some college
- “4” = College degree
- “5” = Post-undergraduate degree
employ: Years with current employer
address: Years at current address
income: Household income in thousands
debtinc: Debt to income ratio (x100)
creddebt: Credit card debt in thousands
othdebt: Other debt in thousands
default: Previously defaulted
- “0” = No
- “1” = Yes

Scoring data profile

The value "$NULL" in the field “default” indicates a missing value which needs to be predicted with the GLE model.

Model stream in IBM SPSS Modeler

This is the stream for GLE model building and scoring. The data source bankloan.sav can be found in the demo source folder of the Modeler installation path.

There is a total of 850 records in bankloan.sav, the first 700 records are the training records while the last 150 records are scoring or testing data. Within the Type node, we can see the detailed metadata information for each field.

Discover the metadata by Type node

The Type node can help the user to detect or specify metadata information for each field, including:

Usage type, such as range, set, ordered set, or flag, for each field in your dataset.
Options for handling missing values and system nulls.
The role of a field for modeling purposes.
Values for a field as well as options used to automatically read values from the dataset.
Field and value labels.

Here we set the default field as the Target, with the other fields as Input(Predictors).

Select the training data with the Select node

The first 700 records refer to the historical customer data, which is the training data to build the GLE model so there are values for the field default. We can set the condition rules in the Select node to separate the training data out.

Select trainning data

Here the condition is to discard any records for which the default field is equal to "$null$".

Build model with GLE node

With the training data, we can use the GLE node to build the model. Opening the GLE node in the stream, we can see the settings. Since we have defined the default field as the target in the Type node, it is set as the target in the setting page already. And the target distribution and link function is set as binary logistic regression since the target default can only have two values, 0 and 1. Binary logistic regression uses a binomial distribution with a logit link, which should be used when the target is a binary response predicted by a logistic regression model.

GLE_Settings_Target_Effects

Selecting the Model effects item, the user can specify the input predictors; either use the predefined inputs from the Type node, or use the custom inputs specified by the user. Here we select to use the predefined inputs.

GLE_Effects

With the combination of settings for target, model effects and distribution/link-function, a GLE model can be built. The Expert user can dive into Build Options tab to specify the details for the model building settings, such as the post estimation features, the parameter estimation methods, and whether to use model selection and regularization.

After all of the settings are specified, the user can click the Run button in the lower left corner. Then a GLE model will be built and kept as the yellow diamond “nugget”.

Nugget

After creating a GLE model, the following information is available in the output viewer. Double clicking to open the output viewer in the nugget, we can see the model details:

Model Information table

The Model Information table provides key information about the model. The table identifies some high-level model settings, such as:

The name of the target field selected in either the Type node or the GLE node Fields tab.
The modeled and reference target category percentages.
The probability distribution and associated link function.
The model building method used.
The number of predictors input and the number in the final model.
The classification accuracy percentage. We can see the classification accuracy is around 82%.
The model type.
The percentage accuracy of the model, if the target is not continuous.

Records Summary and test of model effects

The records summary table shows how many records were used to fit the model, and how many were excluded. The details shown include the number and percentage of the records included and excluded, as well as the unweighted number if you used frequency weighting.

The test of model effects table shows the tpye III Wald Chi-Square test for each model effect.

Predictor Importance

The Predictor Importance graph shows the importance of the top 10 inputs (predictors) in the model as a bar chart.

If there are more than 10 fields in the chart, you can change the selection of predictors that are included in the chart by using the slider beneath the chart. The indicator marks on the slider are a fixed width, and each mark on the slider represents 10 fields. You can move the indicator marks along the slider to display the next or previous 10 fields, ordered by predictor importance.

You can double-click the chart to open a separate dialog box in which you can edit the graph settings. For example, you can amend items such as the size of the graph, and the size and color of the fonts used. When you close this separate editing dialog box, the changes are applied to the chart that is displayed in the Output tab.

Residual by Predicted plot

You can use this plot either to identify outliers, or to diagnose non-linearity or non constant error variance. An ideal plot will show the points randomly scattered about the zero line.

The expected pattern is that the distribution of standardized deviance residuals across predicted values of the linear predictor has a mean value of zero and a constant range. The expected pattern is a horizontal line through zero.

Parameter estimation

Parameter estimation shows the values of model coefficients, with following information:

Standard error.
Lower and upper 95% Wald confidence interval.
The Wald Chi-Square hypothesis test.

By far, we helped Eric leverage the GLE model to describe the characteristics that are indicative of people who are likely to default on loans. Then with the GLE model, we can continue with scoring the new, incoming 150 customers using their customer information to advise whether a loan should be offered to them or not.

Score with new customer data

As mentioned in the introduction, there is data to be scored for 150 new, incoming customers included in the data source bankloan.sav. We need to use a Select node to separate out these records.

Select_Score_Data

Here we use the condition to include the records in which the value of default field is "$null$", which indicates that the value needs to be predicted.

Now we can connect the Select node to the model nugget, and add a Table node in the downstream of nugget to show the score result.

Right click the Table node and select "Run" to execute the scoring process.

Run_Score

After execution is completed, double-clicking the Table node, we will see the prediction results as below:

Score Result

Compared with the original score data, there are two new fields generated, "$L-default" and "$LC-default".

"$L-default" shows the prediction of default, “0” means "No" while "1" means "Yes".
"$LC-default" shows the confidence for the prediction.

With the score results, the prediction value and confidence, Eric can decide whether to offer a loan to each new, incoming customer and have confidence in his judgement of the risk.

Use Case 2: Using GLE to model bike sharing by IBM Watson Studio Notebook

This use case shows you how to create a predictive model of bike sharing trends by using GLE on Apache Spark.

The bike sharing model will:

Identify what affects the amount of bike rentals.
Predict future daily bike rental amounts based on date, weather, and season.

For the detail information and notebook code, please refer to the use case published in Watson Studio Model bike sharing data with SPSS.

#GlobalAIandDataScience
#GlobalDataScience

0 comments

54 views

Permalink

https://community.ibm.com/community/user/blogs/lei-gao/2018/10/25/ibm-spss-generalized-linear-engine-algorithm