Fraud Identifier - ML (R)

This is a personal project using the credit-card-fraud data set to determine which machine learning classification method best predicts fraudulent activity.

Knn

This is a non-parametric method for pattern recognition, using the k-nearest neighbours algorithm.

The mean accuracy value for Knn using 10 fold validation is  0.9946354. The above model also shows that a K-value of 11 produces the maximum accuracy for the model.

 
Kmeans.png

Decision Tree

Rplot07.png

The decision tree is made up of a root nodes and internal nodes, in which each internal node represents a test on a feature,

The data set is made up of time, amount, and anonymised variables of V1 to V28. First we produce the plot on the left using the Tree method in R. This method identifies variable V14 and V17 to be the variables best capable of detecting fraudulent activity. In order to do determine the optimal tree size for the decision tree (otherwise known as Pruned Tree) the elbow graph bellow is used.

According to the Pruned Tree the most valuable variable V14 in identifying fraud.


Kmeans.png

Elbow Graph

  • minimum relative error: 0.4

  • optimal tree size: 2

Rplot03.png

Pruned Tree

The data has determined that following the tree classification method, variable V14 is the most weighted determinator

Random Forest

Screenshot 2020-04-23 at 3.42.43 AM.png

This classification method operates by constructing a multitude of decision trees at training time and outputting the class that is the mean prediction of the individual trees.

The plot visualises the importance of each variable (individual trees) of the dataset to the Random Forest model.

Both the decision tree and random forest have determined the V14 variable to be the most important in terms of detecting credit card fraud. This also coincides with what variable the decision tree deemed most important.

Support Vector Machine (SVM)

The support vector machine algorithm works by trying to place a hyperplane in space between data points in order to classify them. The goal is to find a point in space to place the hyperplane in which the distance to the nearest data point is maximised. Running the svm algorithm on the data set produced the following results.:

 

Confusion matrix plot

Sensitivity: 0.993Specificity: 1Precision: 1Recall: 0.993F1:  0.997Accuracy:0.993
 

Sensitivity: 0.993

Specificity: 1

Precision: 1

Recall: 0.993

F1: 0.997

Accuracy: 0.993

Rplot04.png

 

 Neural Network

This is the most intricate classification method using artificial neural networks vaguely inspired by the biological neural networks that constitute animal brains. Without programming task-specific rules, the algorithm "learns" to identify fraud by considering the credit-card-fraud data provided.

The ‘tanh’ activation variable was used for the Keras formula as it generated the lowest coefficient for losses in comparison to the ‘relu’ and ‘sigmoid’ activation variables . The following plot shows the fitted lines of training and validation losses and accuracies.

The training data decreases its loss and matches the trajectory of the validation data. The losses value of the training data becomes lower than that of the validation by epoch = 8,  suggesting a successful model. Furthermore, the accuracy plot also visualised the training data meeting/outperforming the validation data by epoch = 7.

 Result

The Neural network classification method produces the highest accuracy with a value of 1. One should, therefore, use this model in identifying credit card fraud.

Considering the scale of data banks deal with, processing time may be of concern. The Random Forest method produced the second-largest classification accuracy in addition is to being less computationally expensive as a GPU is not required to finish training.   Hence, banks may also be inclined to use Random Forest.

accuracy+bar.jpg
Previous
Previous

Mall Customer Analysis - ML