IBM Z and LinuxONE - IBM Z - Group home

Credit Card Fraud Detection

  
Introduction
In this blog, let us talk about the potential challenges credit card companies face against credit card frauds and how machine learning/deep learning frame works help detecting them.

Business Use Case
Fraudulent credit card transactions are the biggest concerns to the credit card companies. So detection of potential fraudulent transaction real time help customers not to be charged for the transactions which they have not done. Goal is to build a model that can detect whether transaction is fraudulent or not in real time.

TensorFlow Docker Image
To minimise the efforts while setting up the environment on IBM Linux ONE platform, a pre-built TensorFlow docker image is available for your quick try at the location here

Deep Learning Neural Network (DNN) Model Building
Importing Data Science Libraries
For the use case, several libraries like pandas, scilearn, matplotlib, seaborn and imblearn been used for loading,  processing, visualisation and oversampling of the data. The pandas library used to load the data into a DataFrame object. The matplotlib and seaborn libraries used for plotting. The sklearn library used to perform data processing, model building, and model evaluation. Lastly, imblearn library to apply SMOTE oversampling technique to balance the data.
import pandas as pd
import numpy as np 
import tensorflow.compat.v1 as tf
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
from imblearn.over_sampling import SMOTE
import seaborn as sns
import keras
from keras.models import Sequential
from keras.layers import Dense# Initialising the ANN
from sklearn.preprocessing import StandardScaler
from collections import Counter

Exploratory Data Analysis

The dataset that is used for credit card fraud detection using a neural network is available here: Credit Card Fraud Detection Data. The datasets contain transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where 492 frauds detected out of 284,807 transactions.

Dataset contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount' feature.

Before using the data to train the model, it is better to understand the data we are dealing with. This step is known as an exploratory data analysis to understand the type of data i.e, balanced, or imbalanced data.
import matplotlib.pyplot as plt
labels = 'Fraud', 'Normal'
sizes = [len(df[df.Class == 1]), len(df[df.Class==0])]
explode = (0.1, 0)
colors = ['r','c']
fig, ax = plt.subplots()
ax.pie(sizes, explode=explode, labels=labels, autopct='%3.1f%%', colors=colors)
ax.axis('equal')
plt.show()​


The dataset appears to be highly unbalanced with fraudulent transactions only representing <= 0.2% of all transactions. Unbalanced datasets may lead to bias in learning models and hence should be handled accordingly.
Percentages


Data Visualization
Data visualization is an important aspect to be considered before building a model because visual summary of information makes it easier to understand insights of the data and provides relationships between the data represented in a visualized format. Let’s see how time compares across fraudulent and normal transactions.
f, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=(12,4))

bins = 50

ax1.hist(df.Time[df.Class == 1], bins = bins)
ax1.set_title('Fraud')

ax2.hist(df.Time[df.Class == 0], bins = bins)
ax2.set_title('Normal')

plt.xlabel('Time (in Seconds)')
plt.ylabel('Number of Transactions')
plt.show()​
Number Of Transactions

The 'Time' feature looks similar across both types of transactions. You could argue that fraudulent transactions are more uniformly distributed, while normal transactions have a cyclical distribution. This could make it easier to detect a fraudulent transaction during an 'off-peak' time.

Let's see how the ‘Amount’ compares across fraudulent and normal transactions.
f, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=(12,4))

bins = 30

ax1.hist(df.Amount[df.Class == 1], bins = bins)
ax1.set_title('Fraud')

ax2.hist(df.Amount[df.Class == 0], bins = bins)
ax2.set_title('Normal')

plt.xlabel('Amount ($)')
plt.ylabel('Number of Transactions')
plt.yscale('log')
plt.show()​

Number Of Transactions 2

Most transactions are small amounts, less than $100.

Let's compare Time with Amount and see if we can learn anything new.
f, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=(12,6))

ax1.scatter(df.Time[df.Class == 1], df.Amount[df.Class == 1])
ax1.set_title('Fraud')

ax2.scatter(df.Time[df.Class == 0], df.Amount[df.Class == 0])
ax2.set_title('Normal')

plt.xlabel('Time (in Seconds)')
plt.ylabel('Amount')
plt.show()​

Nothing too useful here.

Normalize Features
We have a matrix where each of the row is a sample and each of the column is a feature. The 'Amount' column is not in line with the anonymized features, After applying StandardScaler() 'Amount' feature takes the form of a normal distribution, which makes it easier to learn the weights, and the column 'Amount' and 'Time' are dropped as they are not needed for model building.
from sklearn.preprocessing import StandardScaler
df['normalizedAmt'] = StandardScaler().fit_transform(df['Amount'].values.reshape(-1,1))
df = df.drop(['Amount'],axis=1)
df = df.drop(['Time'],axis=1)
df.head()​

Data Pre-Processing
Data pre-processing involves preparing the dataset to train the model. The data pre-processing step is crucial and should transform the data in a way that can be processed by the selected algorithm. As this dataset does not contain any missing values or categorical data, most data pre-processing steps are not needed.

The train-test split divides the dataset into a training set and testing set. The training set is used to train the model while the testing set is used to evaluate the model. The test size of 0.3 indicates that 30% of the dataset is chosen to be the testing set. Hence, the training set contains 199364 records while the testing set contains 85443 records.
y = df.iloc[:, df.columns == 'Class']
X = df.iloc[:, df.columns != 'Class']
from sklearn.model_selection import train_test_split
from sklearn.utils import class_weight
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state=17)​

Imbalanced Data Technique
Any model does not work very well with imbalanced datasets whereas Imbalanced classes are a common problem for many classification problems where there were a few samples of minority class for a model to learn. We have various imbalanced techniques available to balance the data.
os = SMOTE(random_state=2)
#Generate the oversample data
res_x,res_y=os.fit_sample(X_train,y_train)​

We have tried to balance the class_weights by passing the Keras weights for each class through a parameter to pay more attention to minority class samples but the model results were not good.

Later we have chosen SMOTE technique to balance the data. SMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space, and drawing a new sample at a point along that line. A randomly selected neighbour is chosen, and a synthetic example is created at a randomly selected point between the two examples in feature space. Applied SMOTE technique to the training set to create a balanced model.
After applying SMOTE over sampling technique training set contains 398036 records.

Training and Evolution of Model

With our training and test data set-up, we are now ready to build our model using Keras to create an artificial neural network. We created a Sequential model by passing a list of layers to the Sequential constructor and in the training phase. We started by defining the input and output layers where the input layer consists of the 29 features that will be fed into the model and the output is 1 value. i.e. prediction of a normal or fraudulent transaction.

The Dense is used to specify the fully connected layer and model. add is used to add a layer to our neural network. The neural network will contain three layers with sigmoid activation functions, and a sigmoid output unit to output the probability of a given transaction is fraudulent. The model will be trained for a maximum of 100 epochs using Adam optimizer and a binary cross-entropy loss function and measuring the performance of the model using a metric. i.e. accuracy. The model is saved after the training and the saved model can be loaded from the same environment or from another environment to predict the value.
# Parameters
classifier = Sequential()
classifier.add(Dense(units =20 , kernel_initializer = 'random_uniform', activation = 'sigmoid', input_dim = 29))
classifier.add(Dense(units = 20, kernel_initializer = 'random_uniform', activation = 'sigmoid'))
classifier.add(Dense(units = 1, kernel_initializer = 'random_uniform', activation = 'sigmoid'))
# Fitting the ANN to the Training set
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
history = classifier.fit(res_x,res_y,batch_size = 32, validation_data=(X_test, y_test),epochs = 100)
#Save the Model
classifier.save('my_model.pb')​

The evaluate function predicts the output for the given input and computes the metrics function specified in the model compile .i.e. accuracy

Confusion Matrix
A confusion matrix allows visualizing the performance of a trained model. Let's see performance measurement in our classification model where output is in two classes.
# Cost function: Cross Entropy
from sklearn.metrics import classification_report, accuracy_score  
from sklearn.metrics import precision_score, recall_score 
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix 
fraud = df[df.Class == 1]
n_outliers = len(fraud) 
n_errors = (y_pred != y_test).sum()
acc = accuracy_score(y_test, y_pred) 
print("The accuracy is {}".format(acc)) 
  
prec = precision_score(y_test, y_pred) 
print("The precision is {}".format(prec)) 
  
rec = recall_score(y_test, y_pred) 
print("The recall is {}".format(rec)) 
  
f1 = f1_score(y_test, y_pred) 
print("The F1-Score is {}".format(f1)) 

LABELS = ['Normal', 'Fraud'] 
conf_matrix = confusion_matrix(y_test, y_pred) 
plt.figure(figsize =(12, 12)) 
sns.heatmap(conf_matrix, xticklabels = LABELS,  
            yticklabels = LABELS, annot = True, fmt ="d"); 
plt.title("Confusion matrix") 
plt.ylabel('True class') 
plt.xlabel('Predicted class') 
plt.show()​
Visualization Graph

As seen from the confusion matrix, the model was correctly able to classify 85218 records as valid and 119 records as fraudulent. However, it incorrectly identified a valid transaction as a fraudulent transaction 79 times and incorrectly identified a fraudulent transaction as a valid transaction 27 times. 

Conclusion
In this blog, We presented the use case “Credit Card Fraud Detection” and provided  steps used for basic exploratory data analysis and visualization techniques to understand the data and normalization, imbalanced techniques. Finally, we discussed the training and evaluation of the model and the confusion metrics to measure the performance of the model.

References
Data Source : Credit card transactional data (https://www.kaggle.com/agpickersgill/credit-card-fraud-detection/data)

Blog Authors
Pavani Vemuri 
Ajay Victor
Pradipta Ghosh
Gummadi Ravi
Chandra Shekhar Reddy Potula