Random Forest for Fraud Detection¶

Random Forest is a versatile machine learning algorithm, like the CART they are capable of performing both regression and classification tasks. It is a type of ensemble learning method, where a group of weak models combine to form a strong model.

In this case, the Random Forest algorithm builds multiple decision trees and merges them together to get a more accurate and stable prediction. The main principle behind the ensemble model is that a group of weak learners come together to form a strong learner.

Building Blocks of Random Forest¶

Decision Trees¶

Like we've seen earlier, a decision tree is a flowchart-like structure in which each internal node represents a feature(or attribute), each branch represents a decision rule, and each leaf node represents an outcome.

Bootstrapping¶

Random Forest uses the technique of bootstrapping to create different subsets of the original dataset. Bootstrapping is a sampling technique in which we create subsets of observations from the original dataset, with replacement. This is particularly useful when we have a large dataset in order to capture the variance of the data. The subsets are known as bootstrapped datasets.

Algorithm idea¶

Draw a random bootstrap sample of size $n$ (with replacement).

Grow a decision tree from the bootstrap sample. At each node:
Randomly select $d$ features without replacement.
Split the node using the feature that provides the best split according to the objective function, for instance, by maximizing the information gain.
Repeat the steps 1-2$k$ times.
Aggregate the prediction by each tree to assign the class label by majority vote (for classification) or to compute the mean prediction (for regression).

Hyperparameters of Random Forest¶

Some of the important hyperparameters in Random Forest include:

n_estimators: The number of trees in the forest.
max_features: The number of features to consider when looking for the best split.
max_depth: The maximum depth of the tree.
min_samples_split: The minimum number of samples required to split an internal node.
min_samples_leaf: The minimum number of samples required to be at a leaf node.

Advantages and Disadvantages of using Random Forest¶

Advantages¶

Can be used for both regression and classification problems.
Handles large data sets with higher dimensionality.
Can handle missing values.
Maintains accuracy for missing data.

Disadvantages¶

Random Forests are slow in generating predictions because they have multiple decision trees.
They are complex and require more computational resources.
They are often not easily interpretable.

Random Forest for Fraud Detection with SciKit-Learn¶

In [1]:

Copied!





import warnings
warnings.simplefilter('ignore')

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import auc, roc_curve, classification_report

import h2o
from h2o.frame import H2OFrame
from h2o.estimators.random_forest import H2ORandomForestEstimator

%matplotlib inline
import warnings
warnings.simplefilter('ignore')

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import auc, roc_curve, classification_report

import h2o
from h2o.frame import H2OFrame
from h2o.estimators.random_forest import H2ORandomForestEstimator

%matplotlib inline

Load the two Datasets¶

In [2]:

Copied!

data = pd.read_csv('./data/Fraud/Fraud_Data.csv', parse_dates=['signup_time', 'purchase_time'])
data.head()
data = pd.read_csv('./data/Fraud/Fraud_Data.csv', parse_dates=['signup_time', 'purchase_time'])
data.head()

Out[2]:

	user_id	signup_time	purchase_time	purchase_value	device_id	source	browser	sex	age	ip_address	class
0	22058	2015-02-24 22:55:49	2015-04-18 02:47:11	34	QVPSPJUOCKZAR	SEO	Chrome	M	39	7.327584e+08	0
1	333320	2015-06-07 20:39:50	2015-06-08 01:38:54	16	EOGFQPIZPYXFZ	Ads	Chrome	F	53	3.503114e+08	0
2	1359	2015-01-01 18:52:44	2015-01-01 18:52:45	15	YSSKYOSJHPPLJ	SEO	Opera	M	53	2.621474e+09	1
3	150084	2015-04-28 21:13:25	2015-05-04 13:54:50	44	ATGTXKYKUDUQN	SEO	Safari	M	41	3.840542e+09	0
4	221365	2015-07-21 07:09:52	2015-09-09 18:40:53	39	NAUITBZFJKHWW	Ads	Safari	M	45	4.155831e+08	0

In [3]:

Copied!

address2country = pd.read_csv('./data/Fraud/IpAddress_to_Country.csv')
address2country.head()
address2country = pd.read_csv('./data/Fraud/IpAddress_to_Country.csv')
address2country.head()

Out[3]:

	lower_bound_ip_address	upper_bound_ip_address	country
0	16777216.0	16777471	Australia
1	16777472.0	16777727	China
2	16777728.0	16778239	China
3	16778240.0	16779263	Australia
4	16779264.0	16781311	China

Add Country to Fraud Data¶

As we saw earlier, the country information is missing from the fraud data. We can use the IP address to get the country information. We will use a mask to get the IP address from the ip_address column, the lower_bound_ip_address and upper_bound_ip_address columns to get the country information like you can see in the output below.

In [5]:

Copied!

# Merge the two datasets and print the first 5 rows
# Merge the two datasets and print the first 5 rows

Out[5]:

	user_id	signup_time	purchase_time	purchase_value	device_id	source	browser	sex	age	ip_address	class	country
0	22058	2015-02-24 22:55:49	2015-04-18 02:47:11	34	QVPSPJUOCKZAR	SEO	Chrome	M	39	7.327584e+08	0	Japan
1	333320	2015-06-07 20:39:50	2015-06-08 01:38:54	16	EOGFQPIZPYXFZ	Ads	Chrome	F	53	3.503114e+08	0	United States
2	1359	2015-01-01 18:52:44	2015-01-01 18:52:45	15	YSSKYOSJHPPLJ	SEO	Opera	M	53	2.621474e+09	1	United States
3	150084	2015-04-28 21:13:25	2015-05-04 13:54:50	44	ATGTXKYKUDUQN	SEO	Safari	M	41	3.840542e+09	0	NA
4	221365	2015-07-21 07:09:52	2015-09-09 18:40:53	39	NAUITBZFJKHWW	Ads	Safari	M	45	4.155831e+08	0	United States

Feature Engineering¶

For this part we want to do some feature engineering. We will create some new features based on the existing ones. We will create the following features:

Time difference between sign-up time and purchase time
If the device id is unique or certain users are sharing the same device (many different user ids using the same device could be an indicator of fake accounts)
Same for the ip address. Many different users having the same ip address could be an indicator of fake accounts
Usual week of the year and day of the week from time variables

In [6]:

Copied!

# Get the time difference between purchase time and signup time
# Get the time difference between purchase time and signup time

In [7]:

Copied!

# Check user number for unique devices
# Check user number for unique devices

In [8]:

Copied!

# Check user number for unique ip_address
# Check user number for unique ip_address

In [9]:

Copied!

# Signup day and week
# Purchase day and week
# Signup day and week
# Purchase day and week

In [10]:

Copied!

data.head()
data.head()

Out[10]:

	user_id	signup_time	purchase_time	purchase_value	device_id	source	browser	sex	age	ip_address	class	country	time_diff	device_num	ip_num	signup_day	signup_week	purchase_day	purchase_week
0	22058	2015-02-24 22:55:49	2015-04-18 02:47:11	34	QVPSPJUOCKZAR	SEO	Chrome	M	39	7.327584e+08	0	Japan	13882	1	1	1	9	5	16
1	333320	2015-06-07 20:39:50	2015-06-08 01:38:54	16	EOGFQPIZPYXFZ	Ads	Chrome	F	53	3.503114e+08	0	United States	17944	1	1	6	23	0	24
2	1359	2015-01-01 18:52:44	2015-01-01 18:52:45	15	YSSKYOSJHPPLJ	SEO	Opera	M	53	2.621474e+09	1	United States	1	12	12	3	1	3	1
3	150084	2015-04-28 21:13:25	2015-05-04 13:54:50	44	ATGTXKYKUDUQN	SEO	Safari	M	41	3.840542e+09	0	NA	60085	1	1	1	18	0	19
4	221365	2015-07-21 07:09:52	2015-09-09 18:40:53	39	NAUITBZFJKHWW	Ads	Safari	M	45	4.155831e+08	0	United States	41461	1	1	1	30	2	37

In [11]:

Copied!

# Define features and target to be used
# Define features and target to be used

Out[11]:

	signup_day	signup_week	purchase_day	purchase_week	purchase_value	source	browser	sex	age	country	time_diff	device_num	ip_num	class
0	1	9	5	16	34	SEO	Chrome	M	39	Japan	13882	1	1	0
1	6	23	0	24	16	Ads	Chrome	F	53	United States	17944	1	1	0
2	3	1	3	1	15	SEO	Opera	M	53	United States	1	12	12	1
3	1	18	0	19	44	SEO	Safari	M	41	NA	60085	1	1	0
4	1	30	2	37	39	Ads	Safari	M	45	United States	41461	1	1	0

In [14]:

Copied!

# Split into 70% training and 30% test dataset
# Define features and target
# Split into 70% training and 30% test dataset
# Define features and target

In [15]:

Copied!

# Build random forest model 
# Build random forest model

drf Model Build progress: |███████████████████████████████████████████████| 100%

In [17]:

Copied!

# Feature importance
# Feature importance

No description has been provided for this image

In [19]:

Copied!

# Classification report
# Classification report

             precision    recall  f1-score   support

          0       0.95      1.00      0.98     41088
          1       1.00      0.53      0.69      4245

avg / total       0.96      0.96      0.95     45333

In [20]:

Copied!

#plot ROC curve and calculate AUC
#plot ROC curve and calculate AUC

Based on the ROC, if we care about minimizing false positive, we would choose a cut-off that would give us true positive rate of ~0.5 and false positive rate almost zero (this was essentially the random forest output). However, if we care about maximizing true positive, we will have to decrease the cut-off. This way we will classify more events as “1”: some will be true ones (so true positive goes up) and many, unfortunately, will be false ones (so false positive will also go up).