---
title: 'CREDIT CARD FRAUD DETECTION'
disqus: hackmd
---
CREDIT CARD FRAUD DETECTION
## Table of Contents
[TOC]
## Introduction
Credit card fraud affects millions of people each year, and 34% of CreditDonkey readers report being victimized at some point. Putting a price tag on credit card fraud is no easy task, but the Nilson Report reported that in 2016, losses topped $22.8 billion. That represents a 4.4% increase over the previous year.
Credit card fraud is happening at all times of the day and night. According to a report from Javelin Strategy, there's a new identity theft victim every two seconds, and many of the incidents involve credit cards.
Therefore, it is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.
My project is using predictive model to see how accurate the model are in detecting fraud credit transaction to notice the system and prevent crminal activities.
## Methodology
### Data
#### Overview
The dataset contains transactions made by credit cards in September 2013 by European cardholders.
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.
It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data.
- Features **V1, V2, … V28** are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. These data is already scaled.
- Feature **'Time'** contains the seconds elapsed between each transaction and the first transaction in the dataset.
- The feature **'Amount'** is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning.
- Feature **'Class'** is the response variable and it takes value 1 in case of fraud and 0 otherwise.
#### EDA:
- The transaction amount is relatively small. The mean of all the mounts made is approximately USD 88.
- There are no "Null" values, so we don't have to work on ways to replace values.
- Most of the transactions were Non-Fraud (99.83%) of the time, while Fraud transactions occurs (017%) of the time in the dataframe.
- The step includes checking correlation, outliers
#### Sub_sample:
##### Undersampling
A sub sample of the dataframe with a 50/50 ratio of fraud and non-fraud transactions in order to help our algorithms better understand patterns that determines whether a transaction is a fraud or not.
There are 492 cases of fraud in our dataset so we can randomly get 492 cases of non-fraud to create our new sub dataframe.Then, I will concat the 492 cases of fraud and non fraud, creating a new sub-sample.
However, there is a risk that classification model will not perform as accurate as we would like to since there is a great deal of information loss (bringing 492 non-fraud transaction from 284,315 non-fraud transaction)
##### Oversampling(SMOTE)
SMOTE creates synthetic points from the minority class in order to reach an equal balance between the minority and majority class.
More information is retained since we didn't have to delete any rows unlike in random undersampling.Hence, it will take more time to train since no rows are eliminated as previously stated.
:::info
Read more about imbalance data solution:
- https://www.marcoaltini.com/blog/dealing-with-imbalanced-data-undersampling-oversampling-and-proper-cross-validation
- https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/
- https://machinelearningmastery.com/random-oversampling-and-undersampling-for-imbalanced-classification/
:::
Models
---
First of all, the data will be splited to train set and test set. Then, these model below would be tried:
- LogisticRegression
- KNearestClassifier
- Support Vector Classifier
- Decision Tree Classifier
Optimizer
---
- GridSearchCV is used to determine the paremeters that gives the best predictive score for the classifiers.
- The score will be F1, Recall and Precision (using confusion matrix)