Credit Card Fraud Prediction

A walkthrough of the modeling process

Exploratory Data Analysis (EDA)

The first step is the explore our data set to see if we can gain any insight about the data. The obvious first, and only, choice is to look at "Amount".

transaction amount histogram

From the above histograms, we can see that there is absolutely no association between amount and fraudulent credit card transactions. But doing this for all 28 other features would be time-consuming and impractical.

Pricipal Component Analysis (PCA)

PCA provides us with multiple benefits: - Dimensionality Reduction (Model Complexity Reduction) - Decreased Overfitting - Handles Multicollinearity The Scree Plot and Cumulative Explained Variance plots below demonstrate how much each principal component accounts for explained variance.

Scree Plot and Cumulative Explained Variance

The plots show that the 1st principal component("PC1") accounts for 33%, of the explained variance. All following principal components account for <11% of explained variance. Because "PC1" contains most of the explained variance, we want to now find which features most greatly correlate to "PC1".

PCA Biplot

The features which most strongly correlate to "PC1" are the features whose vector arrows are the larger along the x-axis. We can also view "PC1`"" vs. "PC2" by looking at all the data.

PCA1 vs PCA2 with Fraud Highlighted

It is interesting to see that just "PC1" and "PC2" almost perfectly split the fraud and non-fraud transactions. Based on the PCA Biplot, we select our variables as well as define out target feature column matrix.

From this exploration we have elected the following features to be used in our model:

'V1','V2','V3','V4','V5','V6','V7','V9','V10','V11','V12','V14','V16','V17','V18','V19'

Modeling - Logistic Regression

Since we are dealing with a binary classification problem (fraud or not fraud), logistic regrression is by far the best starting approach. For the purpose of this project, we will stick to a logistic regression model.

We used "Scikit-Learn" as our modeling package. Our hyperparameters selected we as follows:

  • LOSS function = logit
  • max iterations = 10,000
  • learning rate = 'optimal'

This is where things get interesting. We want to avoid false negatives at all cost while perserving a streamlined payment experience to customers. In this case, a false negative is when our model classifies "no fraud" when in reality the transacton was fraudulent. Think of the model as being the text message fraud alert system your bank uses. When it detects an unusual transaction, it asks you for further confirmation, or outright blocks the transaction. To this end, we want to select a threshold which dictates our descision on whether we should classify a transaction as fraudulent or not. This threshold(τ) needs to be where we minimize the false negative rate while preserving a low false positive rate. By default, τ=0.5. This produces the highest overall accuracy. So how high of a false positive rate are we willing to accept to reduce the false negative rate?

We can illustrate this tradeoff with the following ROC-AUC graph:

ROC-AUC Curve

Since FNR = 1 - TPR, maximizing TPR minimizes FNR. Therefore, we can see that by only subtly decreasing τ, we are able to greatly increase TPR, and thus decrease FNR at only a minor increase in FPR. The specific choice of $\tau$ depends on how large of a FPR we are comfortable. Let's assume that we want a FPR no greater than 2%, then this correlates to a threshold of τ = 0.484

By only slightly decreasing the threshold, we were able to decrease FNR by 0.16%. While this is a minor optimization, a larger tolerance for false positive rate would yield a larger decrease, though FNR decreases slower than FPR increases as the threshold decreases.