Random Forest for Fraud Detection¶
Random Forest is a versatile machine learning algorithm, like the CART they are capable of performing both regression and classification tasks. It is a type of ensemble learning method, where a group of weak models combine to form a strong model.
In this case, the Random Forest algorithm builds multiple decision trees and merges them together to get a more accurate and stable prediction. The main principle behind the ensemble model is that a group of weak learners come together to form a strong learner.
Building Blocks of Random Forest¶
Decision Trees¶
Like we've seen earlier, a decision tree is a flowchart-like structure in which each internal node represents a feature(or attribute), each branch represents a decision rule, and each leaf node represents an outcome.
Bootstrapping¶
Random Forest uses the technique of bootstrapping to create different subsets of the original dataset. Bootstrapping is a sampling technique in which we create subsets of observations from the original dataset, with replacement. This is particularly useful when we have a large dataset in order to capture the variance of the data. The subsets are known as bootstrapped datasets.
Algorithm idea¶
- Draw a random bootstrap sample of size $n$ (with replacement).
- Grow a decision tree from the bootstrap sample. At each node:
- Randomly select $d$ features without replacement.
- Split the node using the feature that provides the best split according to the objective function, for instance, by maximizing the information gain.
- Repeat the steps 1-2$k$ times.
- Aggregate the prediction by each tree to assign the class label by majority vote (for classification) or to compute the mean prediction (for regression).
Hyperparameters of Random Forest¶
Some of the important hyperparameters in Random Forest include:
n_estimators
: The number of trees in the forest.max_features
: The number of features to consider when looking for the best split.max_depth
: The maximum depth of the tree.min_samples_split
: The minimum number of samples required to split an internal node.min_samples_leaf
: The minimum number of samples required to be at a leaf node.
Advantages and Disadvantages of using Random Forest¶
Advantages¶
- Can be used for both regression and classification problems.
- Handles large data sets with higher dimensionality.
- Can handle missing values.
- Maintains accuracy for missing data.
Disadvantages¶
- Random Forests are slow in generating predictions because they have multiple decision trees.
- They are complex and require more computational resources.
- They are often not easily interpretable.
Random Forest for Fraud Detection with SciKit-Learn¶
import warnings
warnings.simplefilter('ignore')
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import auc, roc_curve, classification_report
import h2o
from h2o.frame import H2OFrame
from h2o.estimators.random_forest import H2ORandomForestEstimator
%matplotlib inline
Load the two Datasets¶
data = pd.read_csv('./data/Fraud/Fraud_Data.csv', parse_dates=['signup_time', 'purchase_time'])
data.head()
user_id | signup_time | purchase_time | purchase_value | device_id | source | browser | sex | age | ip_address | class | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22058 | 2015-02-24 22:55:49 | 2015-04-18 02:47:11 | 34 | QVPSPJUOCKZAR | SEO | Chrome | M | 39 | 7.327584e+08 | 0 |
1 | 333320 | 2015-06-07 20:39:50 | 2015-06-08 01:38:54 | 16 | EOGFQPIZPYXFZ | Ads | Chrome | F | 53 | 3.503114e+08 | 0 |
2 | 1359 | 2015-01-01 18:52:44 | 2015-01-01 18:52:45 | 15 | YSSKYOSJHPPLJ | SEO | Opera | M | 53 | 2.621474e+09 | 1 |
3 | 150084 | 2015-04-28 21:13:25 | 2015-05-04 13:54:50 | 44 | ATGTXKYKUDUQN | SEO | Safari | M | 41 | 3.840542e+09 | 0 |
4 | 221365 | 2015-07-21 07:09:52 | 2015-09-09 18:40:53 | 39 | NAUITBZFJKHWW | Ads | Safari | M | 45 | 4.155831e+08 | 0 |
address2country = pd.read_csv('./data/Fraud/IpAddress_to_Country.csv')
address2country.head()
lower_bound_ip_address | upper_bound_ip_address | country | |
---|---|---|---|
0 | 16777216.0 | 16777471 | Australia |
1 | 16777472.0 | 16777727 | China |
2 | 16777728.0 | 16778239 | China |
3 | 16778240.0 | 16779263 | Australia |
4 | 16779264.0 | 16781311 | China |
Add Country to Fraud Data¶
As we saw earlier, the country information is missing from the fraud data. We can use the IP address to get the country information. We will use a mask to get the IP address from the ip_address
column, the lower_bound_ip_address
and upper_bound_ip_address
columns to get the country information like you can see in the output below.
# Merge the two datasets and print the first 5 rows
user_id | signup_time | purchase_time | purchase_value | device_id | source | browser | sex | age | ip_address | class | country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22058 | 2015-02-24 22:55:49 | 2015-04-18 02:47:11 | 34 | QVPSPJUOCKZAR | SEO | Chrome | M | 39 | 7.327584e+08 | 0 | Japan |
1 | 333320 | 2015-06-07 20:39:50 | 2015-06-08 01:38:54 | 16 | EOGFQPIZPYXFZ | Ads | Chrome | F | 53 | 3.503114e+08 | 0 | United States |
2 | 1359 | 2015-01-01 18:52:44 | 2015-01-01 18:52:45 | 15 | YSSKYOSJHPPLJ | SEO | Opera | M | 53 | 2.621474e+09 | 1 | United States |
3 | 150084 | 2015-04-28 21:13:25 | 2015-05-04 13:54:50 | 44 | ATGTXKYKUDUQN | SEO | Safari | M | 41 | 3.840542e+09 | 0 | NA |
4 | 221365 | 2015-07-21 07:09:52 | 2015-09-09 18:40:53 | 39 | NAUITBZFJKHWW | Ads | Safari | M | 45 | 4.155831e+08 | 0 | United States |
Feature Engineering¶
For this part we want to do some feature engineering. We will create some new features based on the existing ones. We will create the following features:
- Time difference between sign-up time and purchase time
- If the device id is unique or certain users are sharing the same device (many different user ids using the same device could be an indicator of fake accounts)
- Same for the ip address. Many different users having the same ip address could be an indicator of fake accounts
- Usual week of the year and day of the week from time variables
# Get the time difference between purchase time and signup time
# Check user number for unique devices
# Check user number for unique ip_address
# Signup day and week
# Purchase day and week
data.head()
user_id | signup_time | purchase_time | purchase_value | device_id | source | browser | sex | age | ip_address | class | country | time_diff | device_num | ip_num | signup_day | signup_week | purchase_day | purchase_week | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22058 | 2015-02-24 22:55:49 | 2015-04-18 02:47:11 | 34 | QVPSPJUOCKZAR | SEO | Chrome | M | 39 | 7.327584e+08 | 0 | Japan | 13882 | 1 | 1 | 1 | 9 | 5 | 16 |
1 | 333320 | 2015-06-07 20:39:50 | 2015-06-08 01:38:54 | 16 | EOGFQPIZPYXFZ | Ads | Chrome | F | 53 | 3.503114e+08 | 0 | United States | 17944 | 1 | 1 | 6 | 23 | 0 | 24 |
2 | 1359 | 2015-01-01 18:52:44 | 2015-01-01 18:52:45 | 15 | YSSKYOSJHPPLJ | SEO | Opera | M | 53 | 2.621474e+09 | 1 | United States | 1 | 12 | 12 | 3 | 1 | 3 | 1 |
3 | 150084 | 2015-04-28 21:13:25 | 2015-05-04 13:54:50 | 44 | ATGTXKYKUDUQN | SEO | Safari | M | 41 | 3.840542e+09 | 0 | NA | 60085 | 1 | 1 | 1 | 18 | 0 | 19 |
4 | 221365 | 2015-07-21 07:09:52 | 2015-09-09 18:40:53 | 39 | NAUITBZFJKHWW | Ads | Safari | M | 45 | 4.155831e+08 | 0 | United States | 41461 | 1 | 1 | 1 | 30 | 2 | 37 |
# Define features and target to be used
signup_day | signup_week | purchase_day | purchase_week | purchase_value | source | browser | sex | age | country | time_diff | device_num | ip_num | class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 9 | 5 | 16 | 34 | SEO | Chrome | M | 39 | Japan | 13882 | 1 | 1 | 0 |
1 | 6 | 23 | 0 | 24 | 16 | Ads | Chrome | F | 53 | United States | 17944 | 1 | 1 | 0 |
2 | 3 | 1 | 3 | 1 | 15 | SEO | Opera | M | 53 | United States | 1 | 12 | 12 | 1 |
3 | 1 | 18 | 0 | 19 | 44 | SEO | Safari | M | 41 | NA | 60085 | 1 | 1 | 0 |
4 | 1 | 30 | 2 | 37 | 39 | Ads | Safari | M | 45 | United States | 41461 | 1 | 1 | 0 |
# Split into 70% training and 30% test dataset
# Define features and target
# Build random forest model
drf Model Build progress: |███████████████████████████████████████████████| 100%
# Feature importance
# Classification report
precision recall f1-score support 0 0.95 1.00 0.98 41088 1 1.00 0.53 0.69 4245 avg / total 0.96 0.96 0.95 45333
#plot ROC curve and calculate AUC
Based on the ROC, if we care about minimizing false positive, we would choose a cut-off that would give us true positive rate of ~0.5 and false positive rate almost zero (this was essentially the random forest output). However, if we care about maximizing true positive, we will have to decrease the cut-off. This way we will classify more events as “1”: some will be true ones (so true positive goes up) and many, unfortunately, will be false ones (so false positive will also go up).