---
title: "BT2101"
output: html_document
---
Objective:
Identifying employees that are more likely to attrit -> to increase in employee retention rate and also to better prepare company in replacing such employees
Hypothesis:
1) age stuff
2) People who leave due to being job hoppers - Not age, but function of total time spent working and previous number of companies which indicates who is a job hopper? (and hence likely to attrit) <- endgame
4) People who leave due to not being appreciated/promoted
5) People who leave due to poor work/life balance or job satisfaction
6) People who leave due to not being paid enough - Monthly Income most important?
Data skewed initially -> oversample data. However, even after oversample, model still not desriable.
In Homework 5, we hypothesised that because majority of attrited individuals came from the age group of 36 and under (Insert graph), that age would have a large impact on the decision to attrit. This was rationalized by explaining that at a younger age many employees have a higher likelihood to attrit due to wanting to switch jobs in search of promotions and pay raises. However, in Homework 7 this was proved to not be the case, since we found no significant difference in model performance after seperating the dataset by age (insert tables).
Therefore, our group has attempted to further refine our original hypothesis
```{r}
library('Metrics')
library('randomForest')
library('ggplot2')
library('ggthemes')
library('dplyr')
library(readr)
library(ipred)
library(caret)
library(randomForest)
library(ROSE)
library(rpart)
#set random seed
set.seed(101)
```
```{r}
#loading dataset
data <- read_csv(file.choose())
#checking dimensions of data
dim(data)
#[1] 1470 35
```
```{r}
#specifying outcome variable as factor
data$Attrition[data$Attrition == "Yes"] <- 1
data$Attrition[data$Attrition == "No"] <- 0
data$Attrition<-as.factor(data$Attrition)
data$BusinessTravel <- as.factor(data$BusinessTravel)
data$Department <- as.factor(data$Department)
data$EducationField <- as.factor(data$EducationField)
data$Gender <- as.factor(data$Gender)
data$JobRole <- as.factor(data$JobRole)
data$MaritalStatus <- as.factor(data$MaritalStatus)
data$OverTime <- as.factor(data$OverTime)
data$Over18 <- NULL
data$EmployeeCount <- NULL
data$StandardHours <- NULL
data$EmployeeNumber <- NULL
#dividing the dataset into train and test
smp_size <- floor(0.70 * nrow(data))
train_ind <- sample(seq_len(nrow(data)), size = smp_size)
train <- data[train_ind, ]
test <- data[-train_ind, ]
table(train$Attrition)
# 0 1
#864 165
data_balanced_over <- ovun.sample(Attrition ~ ., data = train, method = "over",N = 1728)$data
table(data_balanced_over$Attrition)
#0 1
#864 864
```
```{r}
#Bagging
baggedtree<-bagging(Attrition ~ JobSatisfaction + Age + Education + MonthlyIncome + YearsSinceLastPromotion + NumCompaniesWorked ,nbagg=25,data=data_balanced_over)
```
```{r}
actual_bag<-data_balanced_over$Attrition
fitted_bag<-predict(baggedtree,data=data_balanced_over)
confusionMatrix(fitted_bag, actual_bag, dnn = c("Prediction", "Reference"))
auc(fitted_bag,actual_bag)
```
```{r}
actual_bag2<-test$Attrition
fitted_bag2 <-predict(baggedtree,newdata=test)
confusionMatrix(fitted_bag2, actual_bag2,dnn = c("Prediction", "Reference"))
auc(fitted_bag2,actual_bag2)
```
```{r}
#Logistic Regression
logit.reg <- glm(Attrition ~ JobSatisfaction + Age + Education + MonthlyIncome + YearsSinceLastPromotion + NumCompaniesWorked, data = data_balanced_over, family = "binomial")
actual_logit <- data_balanced_over$Attrition
fitted_logit <- predict(logit.reg, type = "response")
fitted_logit[fitted_logit >= 0.5] <- 1
fitted_logit[fitted_logit < 0.5] <- 0
fitted_logit <- as.factor(fitted_logit)
confusionMatrix(fitted_logit, actual_logit, dnn = c("Prediction", "Reference"))
auc(fitted_logit, actual_logit)
```
accuracy = 0.64
auc = 0.64
``` {r}
actual_logit2 <- test$Attrition
fitted_logit2 <- predict(logit.reg, newdata = test, type = "response")
fitted_logit2[fitted_logit2 >= 0.5] <- 1
fitted_logit2[fitted_logit2 < 0.5] <- 0
fitted_logit2 <- as.factor(fitted_logit2)
confusionMatrix(fitted_logit2, actual_logit2, dnn = c("Prediction", "Reference"))
auc(fitted_logit2, actual_logit2)
```
accuracy = 0.6168
auc = 0.595
```{r}
data_balanced_under <- ovun.sample(Attrition ~ ., data = train, method = "under",N = 334)$data
table(data_balanced_under$Attrition)
#0 1
#167 167
```
```{r}
#Bagging
baggedtree<-bagging(Attrition ~ . ,nbagg=25,data=data_balanced_under)
```