Cloud Pak for Data

Come for answers. Stay for best practices. All we’re missing is you.

View Only

Back to Blog List

Using Scikit-Learn training a customer churn model in Cloud Pak for Data 3.5

By Harris Yang posted Wed May 26, 2021 05:53 AM

Using Scikit-Learn training a customer churn model in Cloud Pak for Data 3.5

Scikit-Learn (https://scikit-learn.org) is one of the most popular python Machine Learning framework for morden data scientists to train classification, regression and clustering models with these advanced features:

Simple and efficient tools for predictive data analysis
Accessible to everybody, and reusable in various contexts
Built on NumPy, SciPy, and matplotlib
Open source, commercially usable - BSD license

IBM Cloud Pak for Data 3.5 preinstalled many open source frameworks including Scikit-Learn. The users can easily use Scikit-Learn to train machine learning models in Python notebook in a Cloud Pak for Data project. IBM Cloud Pak for Data 3.5 comes with Scikit-Learn 0.23 with Python 3.7 for users by default and users can also reinstall the specified versions with a customized running environment in Cloud Pak for Data.

The blog shows the basic steps to train a customer churn Scikit-Learn model in a Python notebook of IBM Cloud Pak for Data 3.5.

1. Create a analytics project in IBM Cloud Pak for Data, for example churn-analysis in this blog

2. Create a notebook with Default Python 3.7 runtime

3. Import the data into notebook

# Import CUST_SUM.csv dataset

import os, pandas as pd
df_data_2 = pd.read_csv('/project_data/data_asset/CUST_SUM.csv')
df_data_2.head()

4. Select features and split data into training and testing steps

from sklearn.model_selection import train_test_split

features = ['AGE', 'ACTIVITY', 'EDUCATION', 'NEGTWEETS', 'INCOME', 'SEX', 'STATE']
X, y = df_data_2.loc[:, features], df_data_2.loc[:, 'CHURN']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
print("The number of training data is ", X_train.shape[0])
print("The number of test data is ", X_test.shape[0])

5. Train a scikit-learn logistic regression model

# Train a logistic regression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegressionCV
from sklearn import metrics
from sklearn.model_selection import cross_validate

ctf = ColumnTransformer([("sex_encoder", OneHotEncoder(), ["SEX"]),
                         ("state_encoder", OneHotEncoder(), ["STATE"])],
                       remainder='passthrough')

logreg_pipe = Pipeline([('ctf', ctf), ('logreg_cv', LogisticRegressionCV(Cs=10, cv=3))])
logreg_pipe.fit(X_train, y_train)

6. Evaluate the model with testing data and plot ROC curve

import matplotlib.pyplot as plt
%matplotlib inline
# Draw an ROC Curve to evaluate the model
dtest_predictions = logreg_pipe.predict(X_test)
dtest_predprob = logreg_pipe.predict_proba(X_test)[:,1]

fpr, tpr, thresholds =metrics.roc_curve(y_test, dtest_predprob, pos_label=1)

plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('ROC', fontsize=18)

7. Other scikit-learn model evaluation matrics

from sklearn import metrics

print("about this model--------------------")
print("accuracy_score: %.4g" % metrics.accuracy_score(y_test.values, dtest_predictions))
print("precision_score:%f" % metrics.precision_score(y_test.values, dtest_predictions))
print("recall_score:%f" % metrics.recall_score(y_test.values, dtest_predictions))
print("AUC score(test data): %f" % metrics.roc_auc_score(y_test, dtest_predprob))
print("confusion matrix: \n",metrics.confusion_matrix(y_test.values, dtest_predictions))
tn, fp, fn, tp = metrics.confusion_matrix(y_test.values, dtest_predictions).ravel()
print("TP:%f,FP:%f,FN:%f,TN:%f" % (tp,fp,fn,tn))

#CloudPakforDataGroup

0 comments

8 views

Permalink

https://community.ibm.com/community/user/blogs/harris-yang1/2021/05/26/scikit-learn-churn-model-cpd35

Cloud Pak for Data

Cloud Pak for Data

Using Scikit-Learn training a customer churn model in Cloud Pak for Data 3.5

By Harris Yang posted Wed May 26, 2021 05:53 AM

Permalink

Additional
Resources

Office

Quick Links

Cloud Pak for Data

Cloud Pak for Data

Using Scikit-Learn training a customer churn model in Cloud Pak for Data 3.5

By Harris Yang posted Wed May 26, 2021 05:53 AM

Permalink

Additional Resources

Office

Quick Links

Additional
Resources