Cloud Pak for Data

Cloud Pak for Data

Come for answers. Stay for best practices. All we’re missing is you.

 View Only

Using Scikit-Learn training a customer churn model in Cloud Pak for Data 3.5

By Harris Yang posted Wed May 26, 2021 05:53 AM

  
Using Scikit-Learn training a customer churn model in Cloud Pak for Data 3.5

Scikit-Learn (https://scikit-learn.org) is one of the most popular python Machine Learning framework for morden data scientists to train classification, regression and clustering models with these advanced features:
  • Simple and efficient tools for predictive data analysis
  • Accessible to everybody, and reusable in various contexts
  • Built on NumPy, SciPy, and matplotlib
  • Open source, commercially usable - BSD license

IBM Cloud Pak for Data 3.5 preinstalled many open source frameworks including Scikit-Learn. The users can easily use Scikit-Learn to train machine learning models in Python notebook in a Cloud Pak for Data project. IBM Cloud Pak for Data 3.5 comes with Scikit-Learn 0.23 with Python 3.7 for users by default and users can also reinstall the specified versions with a customized running environment in Cloud Pak for Data.

The blog shows the basic steps to train a customer churn Scikit-Learn model in a Python notebook of IBM Cloud Pak for Data 3.5.
cpd35-scikit-learn-1.jpg

1. Create a analytics project in IBM Cloud Pak for Data, for example churn-analysis in this blog

2. Create a notebook with Default Python 3.7 runtime
cpd35-scikit-learn-2.png

3. Import the data into notebook
# Import CUST_SUM.csv dataset

import os, pandas as pd
df_data_2 = pd.read_csv('/project_data/data_asset/CUST_SUM.csv')
df_data_2.head()

4. Select features and split data into training and testing steps
from sklearn.model_selection import train_test_split

features = ['AGE', 'ACTIVITY', 'EDUCATION', 'NEGTWEETS', 'INCOME', 'SEX', 'STATE']
X, y = df_data_2.loc[:, features], df_data_2.loc[:, 'CHURN']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
print("The number of training data is ", X_train.shape[0])
print("The number of test data is ", X_test.shape[0])

5. Train a scikit-learn logistic regression model
# Train a logistic regression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegressionCV
from sklearn import metrics
from sklearn.model_selection import cross_validate

ctf = ColumnTransformer([("sex_encoder", OneHotEncoder(), ["SEX"]),
                         ("state_encoder", OneHotEncoder(), ["STATE"])],
                       remainder='passthrough')

logreg_pipe = Pipeline([('ctf', ctf), ('logreg_cv', LogisticRegressionCV(Cs=10, cv=3))])
logreg_pipe.fit(X_train, y_train)

6. Evaluate the model with testing data and plot ROC curve
import matplotlib.pyplot as plt
%matplotlib inline
# Draw an ROC Curve to evaluate the model
dtest_predictions = logreg_pipe.predict(X_test)
dtest_predprob = logreg_pipe.predict_proba(X_test)[:,1]

fpr, tpr, thresholds =metrics.roc_curve(y_test, dtest_predprob, pos_label=1)

plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('ROC', fontsize=18)

cpd35-scikit-learn-3.png


7. Other scikit-learn model evaluation matrics
from sklearn import metrics

print("about this model--------------------")
print("accuracy_score: %.4g" % metrics.accuracy_score(y_test.values, dtest_predictions))
print("precision_score:%f" % metrics.precision_score(y_test.values, dtest_predictions))
print("recall_score:%f" % metrics.recall_score(y_test.values, dtest_predictions))
print("AUC score(test data): %f" % metrics.roc_auc_score(y_test, dtest_predprob))
print("confusion matrix: \n",metrics.confusion_matrix(y_test.values, dtest_predictions))
tn, fp, fn, tp = metrics.confusion_matrix(y_test.values, dtest_predictions).ravel()
print("TP:%f,FP:%f,FN:%f,TN:%f" % (tp,fp,fn,tn))
cpd35-scikit-learn-4.png


#CloudPakforDataGroup
0 comments
8 views

Permalink