Predict house price in R

# Predict house price in R Technical documents **[English](https://hackmd.io/s/r1R3MRkgQ)** **[中文版](https://hackmd.io/s/SyFuVG7fm)** [TOC] --- ## 安裝R ---- :::success **NOTE :** ```xml $ sudo apt-get install r-base r-base-dev ``` 會安裝到 **較舊版本的 R** ::: ---- **解決方法:** 1. 取得金鑰 ```xml $ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9 ``` ---- 2. 修改source.list (可以先備份) ```xml $sudo add-apt-repository 'deb [arch=amd64,i386] https://cran.rstudio.com/bin/linux/ubuntu/[ubuntu版本名稱]/' ``` ---- :::warning **NOTE :** 我的ubuntu版本[^first]為16.04，所以版本代號為 **xenial** **察看本機版本 :** ```xml $ lsb_release –a ``` ::: ---- 3. 安裝R ```xml $sudo apt-get install r-base ``` ---- 4. 在命令列輸入 R ```xml $ R ``` ---- 5. 開始使用 :tada: ![](https://i.imgur.com/7kPe1gk.png) ---- :::warning **NOTE:** #### 若上述步驟無法成功，試著把第2步改為 : ```xml $ sudo sh -c 'echo "deb http://cran.csie.ntu.edu.tw/bin/linux/ubuntu [指定版本名稱]/" >> /etc/apt/sources.list' ``` ::: [從舊版更新到新版R](https://stackoverflow.com/questions/46214061/how-to-upgrade-r-in-linux) --- ## 如何執行一個R文件 ---- ### Hello world ! ```xml $ vim test.r write 'print("hello world !")' and save it $ Rscript test.r ``` ![](https://i.imgur.com/K5mANUu.png) :::warning **安裝 vim** ```xml $ sudo apt-get install vim ``` ::: ### 安裝R Package 在執行環境下輸入 ```xml > install.packages("套件名稱") ``` 若遇到系統要求選擇 CRAN mirror 選 Taiwan! ![](https://i.imgur.com/Eb2FqoB.png) 開始自動下載 : ![](https://i.imgur.com/mZy73HP.png) --- 記得在.r檔案寫入檔案 require(套件名稱) || library(套件名稱) :fire: **一定要先安裝套件** :fire: ---- **媽我會R了!** --- ## 資料預處理 ### 什麼叫資料預處理 ? :::success **ANS:** **在企業應用上常以模型（Model）作為討論對象** 但模型能給出的準確度是 **有一定的上限** 至於如何去提升上限 **最主要就是從資料的預先處理下手** 在做預測模型時 ++**預處理通常會花80%的時間**++ ::: ### 預處理方法常使用的有 + 資料的檢查 + 處理遺漏值 + 理離群值 + 挑選特徵 (剔除特徵) > 若能使處理過的資料給出更多資訊 :fire: **就有更大的機會提高模型準確度** :fire: **下面會直接實作stepwise方法** ### STEPWISE REGRESSION 逐步回歸實作以kc_house_data.csv 為例 ```R require(readr) #設定工作區為當前資料夾 setwd("/your/path/to/current/file") #匯入在同資料夾內的dataset out <- read_csv("kc_house_data.csv") ``` ```R #檢查有無遺漏值 any(is.na(out)) #無遺漏值若有的話也要做遺漏值的處理在此先不介紹 ``` ![](https://i.imgur.com/PEtqOZl.png) ```R #稍微瞄一下資料 str(out) ``` ![](https://i.imgur.com/O0YurrG.png) :::info 在事前有先把一些不相關的資料或缺值太多的特徵去掉如：id date view waterfront - view waterfront可以不去除但因為0值太多先排除 - id : 買賣編號 - date : 交易日期--雖然感覺有相關，但因為要額外處理字元，暫不考慮。 :tada: 當然如果有做整理也是可以納入考慮 :tada: ::: ```R #讓每次隨機取值相同 set.seed(18) #取80%資料來做訓練 train.index <- sample( x = 1:nrow(out) ,size = ceiling(0.8*nrow(out)) ) train = out[train.index,] test = out[-train.index,] ``` ```R #設定訓練的上下限 #price是我們要預測的特徵 null = lm(price ~ 1,data = train) full = lm(price ~ .,data = train) ``` ```R #開始訓練需要幾秒鐘 forward.lm = step(null, scope = list(lower=null,upper=full), direction = "forward") #upper及lower一定要設定 ``` 出現與下圖相似的訓練情形(一部分) ![](https://i.imgur.com/WPWVjGa.png) ```R #result summary(forward.lm) ``` ![](https://i.imgur.com/06MyYoj.png) 左側紅框是被邀選出來的特徵，右側是相關性強弱接著只要把左側特徵挑出，就算是完成stepwise regression選取特徵:100: **接著只要把特徵挑出之後就可以輕鬆建模了！** :::info **把逐步回歸的資料再做一次stepwise會挑到更好的特徵嗎?** **ANS : NOPE** 逐步回歸已將高相關性的特徵剔除，基本上若不是隨機抽樣的運氣不好抽到結構特別怪異的子樣本，結果是不會改變的． ::: --- ## Predict Model **這次主要講述建預測模型的部份** :100: :::warning 本來說要使用 **KNN** 預測，但我們要預測的資料是屬於數值資料而非類別資料，所以使用 **Gradient Boosting** 來做數值預測 ::: ### Gradient Boosting * 如果今天你有一個預測模型如下，準確率80%，如何使他準確率提高呢？ $$ Y = M(x) + error $$ * 我們發現 error 項並不只是單純的誤差項，在進一步研究他 $$ error = G(x) + error2 $$ * 準確率來到84％，也就是error項跟我們預測的y還是有相關性存在的 * 繼續把error做分析 $$ Y = M(x) + G(x) + H(x) + error3 $$ * 最後找到優化後的各項權重 $$ Y = alpha * M(x) + beta * G(x) + gamma * H(x) + error4 $$ --- ### 實際操作 GBM ```R #記得要先安裝 xgboost包 require(xgboost) set.seed(3) train.index <- sample(x=1:nrow(feature), size=ceiling(0.8*nrow(feature) )) #將資料分為訓練及對照組 train = feature[train.index, ] test = feature[-train.index, ] ``` ```R dtrain = xgb.DMatrix(data = as.matrix(train[,1:8]),label = train$price) dtest = xgb.DMatrix(data = as.matrix(test[,1:8]),label = test$price) xgb.params = list( colsample_bytree = 0.5, subsample = 0.5, booster = "gbtree", max_depth = 2, eta = 0.03, # 或用'mae'也可以 eval_metric = "rmse", objective = "reg:linear", gamma = 0) #0->-1 ``` ```R cv.model = xgb.cv( params = xgb.params, data = dtrain, nfold = 5, nrounds=200, early_stopping_rounds = 30, print_every_n = 20 ) tmp = cv.model$evaluation_log ``` ```R plot(x=1:nrow(tmp), y= tmp$train_rmse_mean, col='red', xlab="nround", ylab="rmse", main="Avg.Performance in CV") points(x=1:nrow(tmp), y= tmp$test_rmse_mean, col='blue') legend("topright", pch=1, col = c("red", "blue"), legend = c("Train", "Validation") ) best.nrounds = cv.model$best_iteration #best.nrounds xgb.model = xgb.train(paras = xgb.params, data = dtrain, nrounds = best.nrounds) xgb_y = predict(xgb.model, dtest) ``` * 接著使用 lattice 視覺化預測結果來檢驗 ```R #檢查前100比資料 x = c(1:100) y1 = test$price[1:100] y2 = xgb_y[1:100] df1 <- data.frame(x,y1,y2) df1c = df1[order(df1$y1),] #輸出到output.png library(lattice) png("output.png",width = 640,height = 360) xyplot(y1 + y2 ~ x, df1, type = "l") dev.off() ``` * 最後點開output.png ![](https://i.imgur.com/guEG93D.png) **可以看到有相當準確的預測趨勢** :::info **Note:** **xgboost需要R版本3.3.0以上**，需要升級R版本可參考前面安裝R的部份 ::: --- ## GITHUB [R_predict](https://github.com/oowen/R_predict/tree/master/R_predict) ## Reference [龍崗山上的倉鼠](http://kanchengzxdfgcv.blogspot.tw/2016/03/r-by-ubuntu-linux.html) [R筆記 – (18) Subsets & Shrinkage Regression (Stepwise & Lasso)](http://rpubs.com/skydome20/R-Note18-Subsets_Shrinkage_Methods) [R：學習Gradient Boosting算法，提高預測模型準確率](https://read01.com/zh-tw/amdPKx.html#.WzWCc-EzbaU) [R筆記 – (16) Ensemble Learning(集成學習)](http://rpubs.com/skydome20/R-Note16-Ensemble_Learning)