Global AI and Data Science

 View Only

Analyzing Customer Segmentation Using the SPSS TwoStep Cluster Method

By Bo Song posted Thu December 06, 2018 10:32 PM

  

Introduction

Clustering is a machine learning technique. It involves the grouping of data points and groups (clusters) that contain similar data points, while the dissimilarity between groups is as high as possible. Because data points can represent anything, clustering is a common technique for statistical data analysis that is used in many fields. Customer segmentation allows marketers to better tailor their marketing efforts to various audience subsets in terms of promotional, marketing, and product development strategies.

Various clustering algorithms have been developed to group data into clusters in diverse domains. However, most of the traditional clustering algorithms, such as k-means and hierarchical clustering, are effective and accurate on small data sets. But they don't usually scale up to the large data sets. These clustering algorithms work effectively either on pure numeric data or on pure categorical data. They perform poorly on mixed categorical and numeric data types. None of those cluster methods directly address the issue of determining the number of clusters because the ways of determining the number of clusters are difficult.

SPSS understands the advantages and disadvantages of other statistical methods and applied that knowledge to produce a new method that has several desirable features that differentiate it from traditional clustering techniques, the new method is called SPSS TwoStep Cluster Analysis. It supports:

  1. Scalability. By constructing a cluster features (CF) tree that summarizes the records, the TwoStep algorithm allows you to analyze large data sets efficiently. IBM SPSS Modeler has two different versions of TwoStep Cluster: TwoStep Cluster and TwoStep-AS Cluster. TwoStep Cluster is the traditional node that runs on the IBM SPSS Modeler Server. TwoStep-AS Cluster can run when it's connected to IBM SPSS Analytic Server. TwoStep-AS designs and works for big data, and it has better performance by using distributed computation.

  2. Handling of mixed categorical and continuous variables. By assuming variables to be independent, a joint multinomial-normal distribution can be placed on categorical and continuous variables.

  3. Automatic selection of number of clusters. By comparing the values of a model-choice criterion across different clustering solutions, the procedure can automatically determine the optimal number of clusters.

  4. Outliers. Can be set to automatically exclude outliers, or unusual cases that can contaminate your results.

Using TwoStep Cluster Analysis for Customer Segmentation

In today’s competitive world, it is crucial to understand customer behavior and categorize customers based on their characters. This article demonstrates the concept of segmentation of a data from a demonstration data set of bank customers in both IBM SPSS Modeler and Watson Studio. We will use the TwoStep-AS algorithm to derive the optimum number of clusters and understand the underlying customer segments based on the data provided.

Input

The data set has 2000 records, and it consists of the following fields about each customer:

  • Age
  • Months as a Customer
  • Number of Products
  • RFM Score
  • Average Balance Feed Index
  • Number of Transactions
  • Personal Debt to Equity Ratio
  • Months Current Account
  • Number of Loan Accounts
  • Customer ID
  • Has Bad Payment Record
  • Members Within Household
  • Number of Call Center Contacts
  • Gender
  • Marital Status
  • Age Youngest Child
  • Number of Workers in Household
  • Percentage White Collar Workers
  • Household Debt to Equity Ratio
  • Income
  • Weeks Since Last Offer
  • Homeowner
  • Accepted Personal Loan
  • Accepted Retention
  • Accepted Home Equity Loan
  • Accepted Credit Card
  • Annual value
  • Interested in Personal Loan
  • Interested in Retention
  • Interested in Home Equity Loan
  • Interested in Credit Card

The following Figure 1 contains the sample data in table format.

Figure 1: Sample data

The database contains 12 categorical variables and 17 continuous variables. RFM Score and Customer ID are excluded because RFM Score is another segment criteria and Customer ID is meaningless in the task. Continuous variables are standardized by default. Because we use mixed data, we have only the log-likelihood option for distance measure because the Euclidean option works for continuous fields (Figure 2 #1).

Regarding the outliers from our data set, we do not select the Include outlier clusters handling option (Figure2 #2). Outliers are defined as a leaf when the number of cases in a feature tree leaf is less than a specified value.

Figure 2: Measurement and outlier options

To determine the number of clusters automatically (Figure3 #1), choose one of the following clustering methods(Figure 3 #2):

  • Use Clustering Criterion setting. Information criteria convergence is the ratio of information criteria corresponding to two current cluster solutions and the first cluster solution.
  • Distance jump. Distance jump is the ratio of distances corresponding to two consecutive cluster solutions.
  • Maximum. Combine results from the information criteria convergence method and the distance jump method to produce the number of clusters corresponding to the second jump.
  • Minimum. This is the default method. Combine results from the information criteria convergence method and the distance jump method to produce the number of clusters corresponding to the first jump. The default Minimum is selected here because we want a minimum cluster size.

The information criteria convergence method is using either the Schwarz's Bayesian Criterion (BIC) or the Akaike Information Criterion (AIC). Smaller values indicate better models for both criteria. We use the default option BIC here (Figure 3 #3).

TwoStep includes the Feature Importance Method. The method determines how important the features (fields) are in the cluster solution. There are two options (Figure 3 #4):

  • Use Clustering Criterion setting. This is the default method and based on the criterion that is selected in the Clustering Criterion group.
  • Effect size. Feature importance is based on effect size instead of significance values.

Figure 3: Feature Importance methods options

TwoStep-AS Cluster can also support Feature Selection that is shown in Figure 4. You can set rules that determine when fields are excluded. For example, you can exclude fields that have numerous missing values. Besides, the rules of summary statistics, the Adaptive feature selection option runs an extra data pass to find and remove the least important fields for the final clustering solution. We select this option because there are many inputs and we want to select those important ones, for example:

Figure 4: Clustering field options

Output

The Model Specifications (Figure 5) shows those critical input settings that were described previously. It also includes the information about the final model, including the number of regular and outlier clusters, and the chosen features. Four regular clusters are generated with no outliers because they are not detected. There are seven continuous inputs and one categorical input that are used by the final clusters.
Figure 5: Model Specifications

The following table (Figure 6) lists all the excluded fields and why they are dropped by the final model. We can see most of the inputs are excluded by the reason Failed adaptive feature selection criteria. The reason means that those fields have little or no potential for improving the overall model goodness of the final model.
Figure 6: List of the excluded fields

Modeler provides the Feature Importance view (Figure 7) that shows the relative importance of each field in estimating the model.

Figure 7: Relative importance of each field

The centroids table (Figure8) displays the mean of continuous features and modes of categorical ones in each cluster. It shows that the clusters are well separated by the continuous variables. There are four clusters that provide a meaningful customer segmentation.

  1. Cluster 1: Low number of transactions, number of productions, average balance feed index, annual value, age of youngest child, age, and personal debt to equity ratio.
  2. Cluster 2: Medium number of transactions, number of productions, average balance feed index, annual value, age of youngest child, age, and personal debt to equity ratio.
  3. Cluster 3: Low number of transactions, number of productions, average balance feed index, and annual value. High age of youngest child, age, and personal debt to equity ratio.
  4. Cluster 4: High number of transactions, number of productions, average balance feed index, and annual value. Medium age of youngest child, age, and personal debt to equity ratio.
Figure 8: Centroids table

Using TwoStep Cluster Analysis in Watson Studio

Besides the SPSS Modeler, you can also use the Watson Studio notebook to do TwoStep Cluster Analysis by using Python or Scala. The following scripts are examples for Scala notebook. The TwoStep settings were previously described for Modeler. You can visit IBM SPSS Algorithms on SPARK for details about Scala and Python API (http://spss-algo.mybluemix.net/).

import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.sql.types.NumericType
import com.ibm.spss.ml.clustering.{TwoStep, TwoStepModel}
import com.ibm.spss.ml.datapreparation.Descriptives

// exclude both fields "RFM Score" and "Customer ID"
val features = dfData1.columns.filter(x => x != "RFM Score" && x != "Customer ID")
val continousFeatures = features.filter(x => dfData1.schema(x).dataType.isInstanceOf[NumericType])

// calculate summary statistics
val de = Descriptives().
      setInputFieldList(features).
      setCalcBasicSummaryStats(true)

val twoStep = TwoStep().
    setInputFieldList(features).
    setDistMeasure("LOGLIKELIHOOD").
    setMaxEntNonLeaf(8).
    setMaxEntLeaf(8).
    setMaxTreeHeight(3).
    setAutoClustering(true).
    setMaxClusterNum(15).
    setMinClusterNum(2).
    setAutoClusteringMethod("MINIMUM").
    setInformationCriterion("BIC").
    setFeatureImportanceMethod("CRITERION").
    setStandardizeFieldList(continousFeatures).
    setFeatureFiltering(true).
    setFeatureSelection(true).
    setOutlierHandling(false)

val pipeline = new Pipeline().setStages(Array(de, twoStep))
val model = pipeline.fit(dfData1)

// make predictions against the training data using the built model
val dfScored = model.transform(dfData1)
dfScored.show

val ceModel = model.asInstanceOf[PipelineModel].stages(1).asInstanceOf[TwoStepModel]


Besides, making predictions for a data set by using the built model, the SPSS model can also export PMML and StatXML, PMML is the leading standard for statistical and data mining models. It contains not only necessary information about clustering for scoring, but also some statistics, such as features importance. StatXML is SPSS's own specific xml format file that contains extra statistics. Both PMML and StatXML can hold all information to show the above figures.

import scala.xml.XML

// export a PMML
val PMML = ceModel.toPMML
val xml = scala.xml.XML.loadString(PMML)

// print the PMML in a pretty format.
val p = new scala.xml.PrettyPrinter(80, 4)
println(p.format(xml))

// output features importance.
val fields = xml \\ "MiningField"
fields.foreach(x => println(x.attribute("name").get + ": " + x.attribute("importance").get))

// output final clusters
val clusters = xml \\ "Cluster"
clusters.foreach(x => println(x.attribute("name").get + ": " + x.attribute("size").get))

// export a StatXML
val statXML = ceModel.statXML
println(p.format(statXML))

Conclusion

Clustering is widely applied to various domains to explore the hidden and useful patterns inside data. With the increase of data volume, and most collected data in the real world contains both categorical and numeric attributes. The traditional clustering algorithm cannot handle this kind of data effectively. In this session, we showed that the IBM SPSS TwoStep method can solve the problem easily, and it also determines the optimal number of clusters automatically.
#GlobalAIandDataScience
#GlobalDataScience
0 comments
48 views

Permalink