BT2101 - HackMD

--- title: "BT2101" output: html_document --- Objective: Identifying employees that are more likely to attrit -> to increase in employee retention rate and also to better prepare company in replacing such employees Hypothesis: 1) age stuff 2) People who leave due to being job hoppers - Not age, but function of total time spent working and previous number of companies which indicates who is a job hopper? (and hence likely to attrit) <- endgame 4) People who leave due to not being appreciated/promoted 5) People who leave due to poor work/life balance or job satisfaction 6) People who leave due to not being paid enough - Monthly Income most important? Data skewed initially -> oversample data. However, even after oversample, model still not desriable. In Homework 5, we hypothesised that because majority of attrited individuals came from the age group of 36 and under (Insert graph), that age would have a large impact on the decision to attrit. This was rationalized by explaining that at a younger age many employees have a higher likelihood to attrit due to wanting to switch jobs in search of promotions and pay raises. However, in Homework 7 this was proved to not be the case, since we found no significant difference in model performance after seperating the dataset by age (insert tables). Therefore, our group has attempted to further refine our original hypothesis ```{r} library('Metrics') library('randomForest') library('ggplot2') library('ggthemes') library('dplyr') library(readr) library(ipred) library(caret) library(randomForest) library(ROSE) library(rpart) #set random seed set.seed(101) ``` ```{r} #loading dataset data <- read_csv(file.choose()) #checking dimensions of data dim(data) #[1] 1470 35 ``` ```{r} #specifying outcome variable as factor data$Attrition[data$Attrition == "Yes"] <- 1 data$Attrition[data$Attrition == "No"] <- 0 data$Attrition<-as.factor(data$Attrition) data$BusinessTravel <- as.factor(data$BusinessTravel) data$Department <- as.factor(data$Department) data$EducationField <- as.factor(data$EducationField) data$Gender <- as.factor(data$Gender) data$JobRole <- as.factor(data$JobRole) data$MaritalStatus <- as.factor(data$MaritalStatus) data$OverTime <- as.factor(data$OverTime) data$Over18 <- NULL data$EmployeeCount <- NULL data$StandardHours <- NULL data$EmployeeNumber <- NULL #dividing the dataset into train and test smp_size <- floor(0.70 * nrow(data)) train_ind <- sample(seq_len(nrow(data)), size = smp_size) train <- data[train_ind, ] test <- data[-train_ind, ] table(train$Attrition) # 0 1 #864 165 data_balanced_over <- ovun.sample(Attrition ~ ., data = train, method = "over",N = 1728)$data table(data_balanced_over$Attrition) #0 1 #864 864 ``` ```{r} #Bagging baggedtree<-bagging(Attrition ~ JobSatisfaction + Age + Education + MonthlyIncome + YearsSinceLastPromotion + NumCompaniesWorked ,nbagg=25,data=data_balanced_over) ``` ```{r} actual_bag<-data_balanced_over$Attrition fitted_bag<-predict(baggedtree,data=data_balanced_over) confusionMatrix(fitted_bag, actual_bag, dnn = c("Prediction", "Reference")) auc(fitted_bag,actual_bag) ``` ```{r} actual_bag2<-test$Attrition fitted_bag2 <-predict(baggedtree,newdata=test) confusionMatrix(fitted_bag2, actual_bag2,dnn = c("Prediction", "Reference")) auc(fitted_bag2,actual_bag2) ``` ```{r} #Logistic Regression logit.reg <- glm(Attrition ~ JobSatisfaction + Age + Education + MonthlyIncome + YearsSinceLastPromotion + NumCompaniesWorked, data = data_balanced_over, family = "binomial") actual_logit <- data_balanced_over$Attrition fitted_logit <- predict(logit.reg, type = "response") fitted_logit[fitted_logit >= 0.5] <- 1 fitted_logit[fitted_logit < 0.5] <- 0 fitted_logit <- as.factor(fitted_logit) confusionMatrix(fitted_logit, actual_logit, dnn = c("Prediction", "Reference")) auc(fitted_logit, actual_logit) ``` accuracy = 0.64 auc = 0.64 ``` {r} actual_logit2 <- test$Attrition fitted_logit2 <- predict(logit.reg, newdata = test, type = "response") fitted_logit2[fitted_logit2 >= 0.5] <- 1 fitted_logit2[fitted_logit2 < 0.5] <- 0 fitted_logit2 <- as.factor(fitted_logit2) confusionMatrix(fitted_logit2, actual_logit2, dnn = c("Prediction", "Reference")) auc(fitted_logit2, actual_logit2) ``` accuracy = 0.6168 auc = 0.595 ```{r} data_balanced_under <- ovun.sample(Attrition ~ ., data = train, method = "under",N = 334)$data table(data_balanced_under$Attrition) #0 1 #167 167 ``` ```{r} #Bagging baggedtree<-bagging(Attrition ~ . ,nbagg=25,data=data_balanced_under) ```