Using Scikit-Learn training a customer churn model in Cloud Pak for Data 3.5Scikit-Learn (https://scikit-learn.org) is one of the most popular python Machine Learning framework for morden data scientists to train classification, regression and clustering models with these advanced features:
- Simple and efficient tools for predictive data analysis
- Accessible to everybody, and reusable in various contexts
- Built on NumPy, SciPy, and matplotlib
- Open source, commercially usable - BSD license
IBM Cloud Pak for Data 3.5 preinstalled many open source frameworks including Scikit-Learn. The users can easily use Scikit-Learn to train machine learning models in Python notebook in a Cloud Pak for Data project. IBM Cloud Pak for Data 3.5 comes with Scikit-Learn 0.23 with Python 3.7 for users by default and users can also reinstall the specified versions with a customized running environment in Cloud Pak for Data.The blog shows the basic steps to train a customer churn Scikit-Learn model in a Python notebook of IBM Cloud Pak for Data 3.5.
1. Create a analytics project in IBM Cloud Pak for Data, for example churn-analysis in this blog
2. Create a notebook with Default Python 3.7 runtime
3. Import the data into notebook
# Import CUST_SUM.csv dataset
import os, pandas as pd
df_data_2 = pd.read_csv('/project_data/data_asset/CUST_SUM.csv')
df_data_2.head()
4. Select features and split data into training and testing steps
from sklearn.model_selection import train_test_split
features = ['AGE', 'ACTIVITY', 'EDUCATION', 'NEGTWEETS', 'INCOME', 'SEX', 'STATE']
X, y = df_data_2.loc[:, features], df_data_2.loc[:, 'CHURN']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
print("The number of training data is ", X_train.shape[0])
print("The number of test data is ", X_test.shape[0])
5. Train a scikit-learn logistic regression model
# Train a logistic regression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegressionCV
from sklearn import metrics
from sklearn.model_selection import cross_validate
ctf = ColumnTransformer([("sex_encoder", OneHotEncoder(), ["SEX"]),
("state_encoder", OneHotEncoder(), ["STATE"])],
remainder='passthrough')
logreg_pipe = Pipeline([('ctf', ctf), ('logreg_cv', LogisticRegressionCV(Cs=10, cv=3))])
logreg_pipe.fit(X_train, y_train)
6. Evaluate the model with testing data and plot ROC curve
import matplotlib.pyplot as plt
%matplotlib inline
# Draw an ROC Curve to evaluate the model
dtest_predictions = logreg_pipe.predict(X_test)
dtest_predprob = logreg_pipe.predict_proba(X_test)[:,1]
fpr, tpr, thresholds =metrics.roc_curve(y_test, dtest_predprob, pos_label=1)
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('ROC', fontsize=18)

7. Other scikit-learn model evaluation matrics
from sklearn import metrics
print("about this model--------------------")
print("accuracy_score: %.4g" % metrics.accuracy_score(y_test.values, dtest_predictions))
print("precision_score:%f" % metrics.precision_score(y_test.values, dtest_predictions))
print("recall_score:%f" % metrics.recall_score(y_test.values, dtest_predictions))
print("AUC score(test data): %f" % metrics.roc_auc_score(y_test, dtest_predprob))
print("confusion matrix: \n",metrics.confusion_matrix(y_test.values, dtest_predictions))
tn, fp, fn, tp = metrics.confusion_matrix(y_test.values, dtest_predictions).ravel()
print("TP:%f,FP:%f,FN:%f,TN:%f" % (tp,fp,fn,tn))

#CloudPakforDataGroup