Building Model

--- GA: UA-159972578-2 --- ###### tags: `R` `ML` `監督式學習` `機器學習模型` # Building Model ## <a id="n1"></a> 1. 線性迴歸(Linear Regression) ### 1.1 Check Nomal Distribution 【畫圖】 + Density Plot + Box Plot 【常態檢定】 Refer to [統計分析方法應用於R語言](https://hackmd.io/srJT3Y6SSOmPe8mqVmUODA) ### 1.2 Check the Data(Descriptive Statistics) ```{r} summary(data) cor(data$a, data$b) # The pearson's correlations among mpg, hp, and wt Hmisc::rcorr(as.matrix(mtcars[,c("mpg","hp","wt")]), type = "pearson") # mpg hp wt # mpg 1.00 -0.78 -0.87 # hp -0.78 1.00 0.66 # wt -0.87 0.66 1.00 ``` 從敘述性統計去找有關的變數作為模型的選擇。 ### 1.3 Build the Model ```{r} modelLin = lm(Y~ x1 + x2 + x3 + x4, data=TR) lm(log(Y2)~ x1, data=TR) # 若與金錢有關 summary(modelLin) ``` 【電腦選變數: Step】 ```{r} model1 = lm(Temp~ MEI + CO2 + CH4 + N2O + CFC.11 + CFC.12 + TSI + Aerosols, data = Before2006) Bestmodel = step(model1) summary(Bestmodel) # CH4 was eliminated from the full model by the step function. ``` 【訓練模型的RMSE】 ```{r} predTR = predict(modelLin, data = TR) SSE = sum((TR$Y - predTR)^2) RMSE = sqrt(SSE/nrow(TR)) ``` 【測試模型的RMSE】 ```{r} predTS = predict(modelLin, newdata = TS) SSE = sum((predTS - TS$Y)^2) RMSE = sqrt(SSE/nrow(TS)) ``` RMSE是均方根誤差，預測結果與實際結果的誤差愈小代表預測得愈準，因此TR的RMSE會比TS還要小。 ※ RMSE的值不一定介於0~100之間，主要是看資料有幾筆(nrow(data)) 【測試模型的R-square】 ```{r} predTS = predict(modelLin, newdata = TS) SST = sum((TS$Y - mean(TR$Y))^2) SSR = sum((predTS - mean(TS$Y))^2) R2 = 1-SSE/SST ``` 【檢查是否具多重共線性】 + 一般sqrt(vif)>2就表明存在多重共線性問題 + NOTE: Multicollinearity often makes variables insignificant as it increases the S.E. of estimates. + NOTE2: Some researchers use 5 for the criteria. ``` #Check the multicollinearity using the VIF vif(model) sqrt(vif(model))>2 ``` ![](https://i.imgur.com/rvyycrx.png) 【選擇模型】 ```{r} # linear model toy.model = lm(y~x) # altenate model toy.model2 = lm(y~x+I(x^2)) # model comparison library(lmtest) anova(toy.model, toy.model2) # baseed on R square lrtest(toy.model, toy.model2) # baseed on log-likelihood ``` ![](https://i.imgur.com/qnBwoBX.png) ### Robust linear regression ```{r} library(MASS) lmfit=rlm(Quartet$y3~Quartet$x) summary(lmfit) ``` ### Employ robust standard errors + A technique to obtain unbiased standard errors of OLS coefficients under heteroscedasticity + ols() & lm() produce the same results, but we can use the robust SE with ols() ```{r} library(rms) olsfit=ols(wages~age+sex+education, data=SLID, x=TRUE, y=TRUE) # set X=T,y=T for generating the robust SE robcov(olsfit) ``` ## <a id="n2"></a> 2. 邏輯式迴歸(Logistic Regression) + 用於Y為二元變數時 ### 2.1 Prepare the Dataset 【檢查缺項】 ```{r} summary(X) is.na(X$a) ``` 【分析缺項：移除 or 填補】 ```{r} missing_data = subset(is.na(x$a|x$c|x$e)) #X裡具至少一個以上NA的子項目(a,c,e)挑出來 #不僅是缺項單位格,該缺項所在一整列資料(row) summary(missing_data) #包含原先X的所有自變數(a,b,c,d,e)欄位 #即沒有NA值的欄位(b,d)也在 nrow(missing_data) table(missing$Y) #找出應變數Y,檢查要移除還是要填補NA項 table中是NA且Y=1的值/全部有幾個NA ``` 【補缺項工具】 ```{r} library(mice) set.seed(144) #Note that the imputation results are not the same even if you set the random seed. vars.for.imputation = setdiff(names(X), "Y") #設定vars.for.imputation來移除Y imputed = complete(mice(X[vars.for.imputation])) #只用所有自變數來估算NA值 ``` Imputation predicts missing variable values for a given observation using the variable values that are reported. We called the imputation on a data frame with the dependent Y removed, so we predicted the missing values using only other independent variables. ### 2.2 Random Split ```{r} library(caTools) set.seed(1234) spl = sample.split(X$Y, SplitRatio=0.7) TR = subset(X, spl==TRUE) TS = subset(X, spl==FALSE) ``` Now that we have prepared the dataset, we need to split it into a training and testing set. To ensure everybody obtains the same split, set the random seed to 144 (even though you already did so earlier in the problem). 【隨機種子的功用】 + Different seed -> get different splits. + Different splitRatio -> get different splits. + Re-ran all steps -> get the same result: set a random seed, split, set the seed again to the same value, and then split again, you will get the same split. + Re-ran step3~5 -> get different result: set the seed and then split twice, you will get different splits. 【從迴歸係數估計邊際效用】 + ln(oddA) = ln(oddB) + 1.61 + oddA = exp(1.61) * oddB + Rule: e^(a+b) = e^a * e^b. Q: Application A has FICO credit score 700 while the borrower in Application B has FICO credit score 710. What is the value of Logit(A) - Logit(B)? What is the value of O(A)/O(B)? ```{r} # the difference of logits logitA = -0.00931679*700 #係數相同,x值不同 logitB = -0.00931679*710 logitA - logitB #0.09317 # the ratio of odds oddA = exp(logitA) #法1 oddB = exp(logitB) oddA/oddB #1.098 exp(logitA - logitB) or exp(0.09317) #法2 ``` ### 2.3 Build the Model ```{r} modelLog = glm(Y~ x1+x2+..., data=TR, family=binomial) pred = predict(modelLog, data=, type="response") #response:回傳機率 ``` 【訓練模型的AUC】 ```{r} library(ROCR) modelTR = glm(Y~ x1+x2+..., data=TR, family=binomial) PredTR = predict(modelTR, type = "response") PredROCR = prediction(PredTR, TR$Y) PerfROCR = performance(PredROCR, "tpr","fpr") #為了畫AUC圖 par(cex=0.8) plot(PerfROCR, colorize=TRUE, print.cutoffs.at=seq(0,1,0.1), text.adj=c(-0.2,1,7)) #AUC圖 #text.adj加了美觀,讓ROC線上的值右移一點,不重疊在線上 as.numeric(performance(PredROCR, "auc")@y.values) #AUC實際值 ``` 【預測模型的AUC】 ```{r} library(ROCR) modelTR = glm(Y~ x1+x2+..., data=TR, family=binomial) predTS = predict(modelTR, newdata=TS, type="response") predROCR = prediction(predTS, TS$Y) as.numeric(performance(predROCR, "auc")@y.values) #AUC實際值 ``` 【Confusion Matrix】 ※ 只有拿訓練模型去適應Testing Data後才會做 ```{r} table(TS$Y, predTS >= 0.5) #threshold=0.5 ``` ### 2.4 Verify the Model 【準確性:訓練模型 V.S. Baseline】 ※ ACC從Confusion Matrix得來 ```{r} predTS = predict(modelTR, newdata=TS, type="response") table(TS$Y, predTS >= 0.45) #Confusion Matrix(由pred和TS data得來) ACC = (TN+TP)/(TN+FP+FN+TP) #baseline ACC table(TS$Y) ACC = (TN+TP)/(TN+FP+FN+TP) ``` 【敏感性 & 明確性】 ```{r} #Using the values from Confusion Matrix: table(TS$Y, predTS >= 0.45) sensitivity = TP/(TP+FN) specificity = TN/(TN+FP) ``` 在不同的threshold條件下，會產生出不同Confusion Matrix結果；結果中包含許多資訊，如：準確率（Accuracy）、辨識率（AUC）、敏滿性（Sensitivity）與明確性（Specificity）。依照所欲達到的目的，或是一件事情發生結果的嚴重性，都可以有不同的決策結果，故即使模型不能夠大幅度增加ACC，但能夠以不同的角度來解釋問題並且對於決策有所助益，該模型就是一個好的模型。 ### 2.5 Make the Decision ```{r} risk=-100 action=-60 cost=-10 ExpPayoff = TN*0 + TP*action + FN*risk + Total*cost ``` ## <a id="n3"></a> 3. 決策樹(Decision Tree) ### 3.1 Build the Model 【連續型】 ```{r} # 將價格取ln ts$Average_next = log(ts$Average_next+1) tr$Average_next = log(tr$Average_next+1) library(rpart) rpart1 = rpart(Average_next~ Weekly_Gross + Theater_Count + Theater_Count_Change + Average_Gross + Week_Count + Date, data=tr, method="anova", minbucket=25, na.action = "na.pass") library(rpart.plot) prp(rpart1) #看樹長什麼樣,簡化版 rpart.plot(rpart1, cex=0.6) #看樹長什麼樣,進階版 pred_rpart1tr = predict(rpart1, tr) %>% { exp(.)-1 } pred_rpart1ts = predict(rpart1, ts) %>% { exp(.)-1 } # 將值轉回來 ts$Average_next = exp(ts$Average_next)-1 tr$Average_next = exp(tr$Average_next)-1 ``` 【類別型-兩類別】 ```{r} library(rpart) cart = rpart(Y~ x1+x2+..., data=TR, method="class", minbucket=25) #class:讓結果輸出為類別 #都不寫是機率(做regression時用) #minbucket:數小樹大;數大樹小 prp(cart) #看樹長什麼樣,簡化版 library(rpart.plot) rpart.plot(caret, cex=0.6) #看樹長什麼樣,進階版 ``` ### 3.2 Model Engineering 【連續型】 + MAE ```{r} MAE = (1/nrow(ts))*sum(abs(ts$Average_next-pred_rpart1ts)) ``` + RMSE ```{r} SSE = sum((ts$Average_next - pred_rpart1ts)^2) RMSE = sqrt(SSE/nrow(ts)) ``` + R2 ```{r} SST = sum((ts$Average_next - mean(tr$Average_next))^2) SSR = sum((pred_rpart1ts - mean(ts$Average_next))^2) R2 = 1-SSE/SST ``` 【類別型-兩類別】 + ACC ```{r} predCart = predict(cart, newdata=TS, type = "class") table(predCart, TS$Y) ACC = (TN+TP)/(TN+FP+FN+TP) ``` + AUC ```{r} library(ROCR) predCart = predict(cart, newdata=TS) #把type拿掉改顯示成機率 predCart #看一下此時呈現的機率會分成Y=0及Y=1兩行 predCart = predCart[,2] #我們只需要Y=1的機率, 將Y=0得捨去 predROCR = prediction(predCart, TS$Y) predROCR = prediction(predCart[,2], TS$Y) #法2(簡化版) plot(perf) as.numeric(performance(predROCR, "auc")@y.values) ``` + AUC和ACC的精簡公式 ```{r} colAUC(pred, TS$a) table(TS$a, pred > 0.5) %>% {sum(diag(.))/sum(.)} ``` ### 3.3 Cross Validation & Parameter Tuning Let us select the cp parameter for our CART model using k-fold cross validation, with k = 10 folds. ```{r} library(caret) library(e1071) numFolds = trainControl(method="cv", number=10) #10 folds cv cpGrid = expand.grid(.cp=seq(0.01,0.5,0.01)) #parameter cp = train(Y~., data=TR, method="rpart", trControl=numFolds,tuneGrid=cpGrid) cp #selecting cp by cross-validation ``` ### 3.4 The Final Model 【類別型-兩類別】 ```{r} modelCV = rpart(Y~., data=TR, method="class", cp=0.18) #cp從上式得來 predCV = predict(modelCV, newdata=TS, type="class") table(TS$Y, predCV) ACC = (TN+TP)/(TN+FP+FN+TP) prp(modelCV) ``` Compared to the original accuracy using the default value of cp, this new CART model is an improvement, and so we should clearly favor this new model over the old one -- or should we? Plot the CART tree for this model. How many splits are there? + In some applications, such an improvement in accuracy would be worth the loss in interpretability. + In others, we may prefer a less accurate model that is simpler to understand and describe over a more accurate (but more complicated) model. ## <a id="n4"></a> 4. 隨機森林(Random Forest) ### 4.1 Cross Validation 【類別型-多類別】 ```{r} library(doParallel) detectCores() clust = makeCluster(detectCores()) registerDoParallel(clust) getDoParWorkers() t0 = Sys.time() nfolds = 4 df = foreach(ntree = c(10,20), .combine=rbind) %:% foreach(mtry = c(4,5), .combine=rbind) %:% foreach(i = 1:nfolds, .combine=rbind, .packages = c('caTools', 'randomForest')) %dopar% { sx = sample.split(Y, (nfolds-1)/nfolds) rf = randomForest( BUY_TYPE~., X[sx, c(2:4,7:11,15,32:42)], ntree=ntree, mtry=mtry) pred = predict(rf, X[!sx, c(2:4,7:11,15,32:42)], type = "class") x = table(actual = X$BUY_TYPE[!sx], pred) acc = sum(diag(x)) / sum(x) data.frame(ntree = ntree, mtry = mtry, acc = acc) } Sys.time() - t0 stopCluster(clust) ``` 【選擇最好的參數組合】 ```{r} df %>% group_by(ntree, mtry) %>% summarise(acc = mean(acc)) %>% arrange(acc) ``` ### 4.2 Build the Model 【類別型-兩類別】 ```{r} library(randomForest) Forest = randomForest(Y~ x1+x2+..., data = TR, nodesize=25, ntree= 200) predForest = predict(Forest, newdata = TS, type = "class") table(TS$Y,predForest) ACC = (TN+TP)/(TN+FP+FN+TP) ``` ## <a id="n5"></a> 5. 組合模型(Ensambling) ### 5.1 Compare ACC ```{r} px = cbind(glm=predglm, cart1=pred1, cart2=pred2, cart3=pred3, cart4 = pred4, cart5 = pred5, cart6 = pred6, cart7 = pred7, rf = PredictForest) apply(px, 2, function(x) { table(TS$buy, x > 0.5) %>% {sum(diag(.))/sum(.)} }) %>% sort ``` ### 5.2 Compare AUC ```{r} par(cex=0.7,mar=c(6,6,5,3)) colAUC(px, TS$buy, T) ``` ### 5.3 Correlation & Ensambling + 兩個相關性高的模型不見得是最好的組合 + 兩個極端相反的模型組合起來反而能幫忙解釋更多不同的角度 ```{r} cor(px) glm_cart = (px[,"glm"] + px[,"cart6"])/2 glm_rf = (px[,"glm"] + px[,"rf"])/2 ``` ### 5.4 ACC & AUC Table ```{r} px2 = cbind(px, glm_cart, glm_rf) rbind(apply(px2, 2, function(x) { table(TS$buy, x > 0.5) %>% {sum(diag(.))/sum(.)} }), colAUC(px2, TS$buy)) %>% t %>% data.frame %>% setNames(c("Accuracy", "AUC")) px3 = cbind(lm=RSquarelm, cart9=RSquarerp) rbind(px3) %>% data.frame %>% setNames(c("lm","cart")) ``` ## <a id="n6"></a> 6. 極限梯度提升(XGBoost) + 極度準確(當今的比賽常用)，為隨機森林的一種 + 能夠產生連續及類別預測模型 + 只能處理數值變數(須將類別變數做轉換) + 自動處理NA(可以餵垃圾) ### 6.1 Cleaning Data ```{r} library(xgboost) names(TR)[-c(1:2)] # 先把不需要的欄位找出 X = TR[,-c(1:2)] # 刪除 fx = sapply(X, class) == "factor" # 找出類別變數 X = cbind(X[,!fx],model.matrix.lm(~.-1, X[,fx], na.action="na.pass")) # 將類別變數轉換成虛擬變數(0 or 1) # -1是因為xgb要從0開始計算 ``` ### 6.2 Cross Validation 【連續型】 ```{r} t0 = Sys.time() A = 0; df = data.frame() DM = xgb.DMatrix(data=data.matrix(TR$X), label = TR$Y - 1) # 先用Training Data調參建模，確認準確率後做成最終模型，再代入Testing Data產生預測結果 for(g in c(0.2,0.3,0.4)) for(d in c(6,7)) # 參數調校 for(e in c(0.5,0.6,0.7)) { set.seed(1234) cv = xgb.cv( data=DM, nfold=4, nrounds=100, prediction=T, # 參數調校 verbose=0, early_stopping_rounds = 5, # 連續五次仍未進步即停止 param=list("objective" = "reg:linear", "eval_metric" = "rmse", "max_depth" = d, "eta" = e, "gamma"= g ) ) r = cv$best_iteration rmse = cv$evaluation_log$test_rmse_mean[r] if(rmse > A) { A = rmse save(cv, file="C:/code/cv.rdata") } s = sprintf("d=%d, e=%.3f, g=%.3f: rmse=%.5f, rnd=%d",d,e,g,rmse,r) # %d為印出整數 # %.3f為印出小數點後三位 print(s) df = rbind(df, data.frame(d,e,g,rmse,r)) gc()} Sys.time() - t0 ``` 【類別型-多類別】 ```{r} t0 = Sys.time() A = 0.5; dfAy = data.frame() DM = xgb.DMatrix(data=data.matrix(X[AX$train,]), label = AY$y - 1) # 此皆為TR data的X與Y變數 for(g in c(0.2,0.5,0.8)) for(d in c(6,7)) for(e in c(0.2,0.3,0.4)){ set.seed(1234) cv = xgb.cv( data = DM, nfold = 8, nrounds = 320, prediction = T, verbose = 0, early_stopping_rounds = 5, param = list( "objective" = "multi:softprob", "eval_metric" = "mlogloss", "num_class" = 7, "max_depth" = d, "eta" = e, "gamma" = g )) r = cv$best_iteration loss = cv$evaluation_log$test_mlogloss_mean[r] mx = table(as.integer(AY$y), max.col(cv$pred)) if(acc > A){ A = acc save(cv, file="cvAy.rdata") } s = sprintf("d=%d, e=%.3f, g=%.3f: acc=%.5f,rnd=%d", d,e,g,acc,r) print(s) dfAy = rbind(dfAy, data.frame(d,e,g,acc,r)) gc() } Sys.time() -t0 ``` 【畫出結果】 ```{r} dfAy %>% mutate(g = as.factor(g)) %>% ggplot(aes(x=e, y=acc)) + theme_light() + geom_line(aes(colour=g), size=1.2, alpha=0.5) + geom_point(aes(colour=g), size=1.5) + facet_wrap(~d, nrow = 1) ``` ### 6.3 Build the Model 【連續型】 ```{r} t0 = Sys.time() DM = xgb.DMatrix(data=data.matrix(X[AX$train,]), label = AY$y - 1) # 此皆為TR data的X與Y變數 params = list("objective" = "reg:linear", "eval_metric" = "rmse", "max_depth" = 7, "eta" = 0.7, "gamma"= 0.5 ) xgb = xgb.train(params, DM, nrounds=100, verbose=1, prediction=T) # 建模時只需放參數(預測才放資料) Sys.time() - t0 save(xgb, file="C:/code/xgbA.rdata") ``` 【類別型-多類別】 ```{r} t0 = Sys.time() DM = xgb.DMatrix(data=data.matrix(X[AX$train,]), label = AY$y - 1) # 此皆為TR data的X與Y變數 params = list("objective" = "multi:softprob", "eval_metric" = "mlogloss", "num_class" = 7, "max_depth" = 6, "eta" = 0.4, "gamma"= 0.3 ) xgb = xgb.train(params, DM, nrounds=320, verbose=1, prediction=T) Sys.time() - t0 save(xgb, file="data/xgb4b.rdata") ``` ### 6.4 Predict Outcome 【連續型】 ```{r} predXGBts = predict(xgb, newdata = xgb.DMatrix(data.matrix(Xts)), # predicting the test set reshape = T) %>% { exp(.-1) } ``` + MAE ```{r} MAE = (1/nrow(ts))*sum(abs(ts$Average_next-predXGBts)) ``` + RMSE ```{r} SSE = sum((ts$Average_next - predXGBts)^2) RMSE = sqrt(SSE/nrow(ts)) ``` + R2 ```{r} SST = sum((ts$Average_next - mean(tr$Average_next))^2) SSR = sum((predXGBts - mean(ts$Average_next))^2) R2 = 1-SSE/SST ``` 【類別型-多類別】 ```{r} predXGB = predict(xgb, xgb.DMatrix(data.matrix(X[!AX$train,])), # predicting the test set reshape = T) types = levels(Y)[max.col(predXGB)] ``` + ACC ```{r} (mx = table(AY$y, max.col(cv$pred))) # Confusion Matrix (acc = sum(diag(mx)) / sum(mx)) ``` 【Cross-Check the Propotion of BUY-TYPE】 ```{r} rbind( table(types) %>% prop.table, table(Y) %>% prop.table ) %>% round(4) # a b c d e f g # [1,] 0.2746 0.0296 0.1115 0.2767 0.2749 0.0293 0.0034 # [2,] 0.2751 0.0295 0.1105 0.2749 0.2767 0.0296 0.0036 # 計算兩者比例差 abs(prop.table(table(types)) - prop.table(table(Y))) %>% sum ```