# Predict house price in R
Technical documents
**[English](https://hackmd.io/s/r1R3MRkgQ)**
**[中文版](https://hackmd.io/s/SyFuVG7fm)**
[TOC]
---
## install R
----
:::success
**NOTE :**
```xml
$ sudo apt-get install r-base r-base-dev
```
Will be installed to **older versions of R**
:::
----
**Solution:**
1. Get the key
```xml
$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
```
----
2. modify source.list
```xml
$sudo add-apt-repository 'deb [arch=amd64,i386] https://cran.rstudio.com/bin/linux/ubuntu/[ubuntu Version name]/'
```
----
:::warning
**NOTE :**
my ubuntu version is 16.04 , so the version name is **xenial**
**Check version :**
```xml
$ lsb_release –a
```
:::
----
3. install R
```xml
$sudo apt-get install r-base
```
----
4. Enter 'R' in the command line
```xml
$ R
```
----
5. Start using :tada:

----
:::warning
**NOTE:**
#### If the above steps fail, try replacing step 2 with :
```xml
$ sudo sh -c 'echo "deb http://cran.csie.ntu.edu.tw/bin/linux/ubuntu [Version name]/" >> /etc/apt/sources.list'
```
:::
[Update from the old version to the latest version R](https://stackoverflow.com/questions/46214061/how-to-upgrade-r-in-linux)
---
## How to execute an R file
----
### Hello world !
```xml
$ vim test.r
write 'print("hello world !")' and save it
$ Rscript test.r
```

:::warning
**安裝 vim**
```xml
$ sudo apt-get install vim
```
:::
### install R Package
Enter in the execution environment
```xml
> install.packages("Package name")
```
If the system requires to select CRAN mirror , choose Taiwan (or any palce near to you)

Start downlonding :

---
Remember to write in .r files
require(package name) || library(package name)
:fire: **Be sure to install the packages first** :fire:
----
**Mom, I have learned R!**
---
## PRE-PROCESSING
### What is data preprocessing? ?
:::success
**ANS:**
**In the enterprise application, the model is often used as a discussion object**
But the accuracy **that is a certain upper limit**
As for how to raise the ceiling **The most important thing is to start with the preprocessing of data**
When doing prediction models ++**preprocessing usually takes 80% of the time**
:::
### Preprocessing method
Frequently used
+ check of data
+ Handle missing values
+ Management Outliers
+ Pick feature (reject feature)
> If you can make processed data give more information
:fire: **There is a greater chance of improving model accuracy ** :fire:
**The following will directly implement the stepwise method**
### STEPWISE REGRESSION
Take kc_house_data.csv as an example
```R
require(readr)
#Set workspace as current folder
setwd("/your/path/to/current/file")
#Import datasets in the same folder
out <- read_csv("kc_house_data.csv")
```
```R
#Check for missing value
any(is.na(out))
#No missing value, if any, do the processing of missing value.
```

```R
#check
str(out)
```

:::info
First remove some irrelevant data
like :id date view waterfront
- view Waterfront can not be removed but because 0 is too much, so exclude
- id :
Trading number
- date : The transaction date - although it feels relevant, it is not considered because of the extra handling of characters.
:tada: Of course, if it has been dealt with, it can also be considered. :tada:
:::
```R
#Let each random value be the same
set.seed(18)
#Take 80% of data to do training
train.index <- sample( x = 1:nrow(out) ,size = ceiling(0.8*nrow(out)) )
train = out[train.index,]
test = out[-train.index,]
```
```R
#Set the upper and lower limit of training
#price is the characteristic we want to predict
null = lm(price ~ 1,data = train)
full = lm(price ~ .,data = train)
```
```R
# Start training , takes a few seconds
forward.lm = step(null,
scope = list(lower=null,upper=full),
direction = "forward")
#upper及lower一定要設定
```
* A training situation similar to the one below (part)

```R
#result
summary(forward.lm)
```

* The red box on the left is features that has been invited out. The right side is related to the strength
* Then just pick out the left features, we have completed stepwise regression selection features: 100:
**You can easily model after you pick out the features!**
:::info
** Will stepwise regression data be better after stepwise the output feature?**
**ANS : NOPE**
The stepwise regression has removed the features of high correlation. Basically, if it is not lucky that random samples of badly sampled subsamples with particularly bizarre structures are obtained, the results will not change.
:::
---
## Predict Model
**This time I mainly talk about the part of the construction prediction model.** :100:
:::warning
In this case, we want to use **KNN** predictions, but the data we want to predict is numerical data, not category data, so use **Gradient Boosting** to do numerical predictions.
:::
### Gradient Boosting
* If you have a forecasting model as follows, with an accuracy of 80%, how can you improve his accuracy?
$$
Y = M(x) + error
$$
* We found that the error term is not just a simple error term.
$$
error = G(x) + error2
$$
* The accuracy rate comes to 84%, that is, the error term is related to our predicted y.
* Continue to analyze the error
$$
Y = M(x) + G(x) + H(x) + error3
$$
* Finally find the optimized weights
$$
Y = alpha * M(x) + beta * G(x) + gamma * H(x) + error4
$$
---
### Actual operation : GBM
```R
#install xgboost before
require(xgboost)
set.seed(3)
train.index <- sample(x=1:nrow(feature), size=ceiling(0.8*nrow(feature) ))
#Divide the data into training and control groups
train = feature[train.index, ]
test = feature[-train.index, ]
```
```R
dtrain = xgb.DMatrix(data = as.matrix(train[,1:8]),label = train$price)
dtest = xgb.DMatrix(data = as.matrix(test[,1:8]),label = test$price)
xgb.params = list(
colsample_bytree = 0.5,
subsample = 0.5,
booster = "gbtree",
max_depth = 2,
eta = 0.03,
# 或用'mae'也可以
eval_metric = "rmse",
objective = "reg:linear",
gamma = 0) #0->-1
```
```R
cv.model = xgb.cv(
params = xgb.params,
data = dtrain,
nfold = 5,
nrounds=200,
early_stopping_rounds = 30,
print_every_n = 20
)
tmp = cv.model$evaluation_log
```
```R
plot(x=1:nrow(tmp), y= tmp$train_rmse_mean, col='red', xlab="nround", ylab="rmse", main="Avg.Performance in CV")
points(x=1:nrow(tmp), y= tmp$test_rmse_mean, col='blue')
legend("topright", pch=1, col = c("red", "blue"),
legend = c("Train", "Validation") )
best.nrounds = cv.model$best_iteration
#best.nrounds
xgb.model = xgb.train(paras = xgb.params,
data = dtrain,
nrounds = best.nrounds)
xgb_y = predict(xgb.model, dtest)
```
* Then use the lattice visualized predictions to test
```R
#檢查前100比資料
x = c(1:100)
y1 = test$price[1:100]
y2 = xgb_y[1:100]
df1 <- data.frame(x,y1,y2)
df1c = df1[order(df1$y1),]
#輸出到output.png
library(lattice)
png("output.png",width = 640,height = 360)
xyplot(y1 + y2 ~ x, df1, type = "l")
dev.off()
```
* Finally open output.png

**It can be seen that there is a fairly accurate forecast trend**
:::info
**Note:**
**xgboost requires R version 3.3.0 or later **, if you need to upgrade R version ,you can refer to the previous 'installation of R' part
:::
---
## GITHUB
[R_predict](https://github.com/oowen/R_predict/tree/master/R_predict)
## Reference
[龍崗山上的倉鼠](http://kanchengzxdfgcv.blogspot.tw/2016/03/r-by-ubuntu-linux.html)
[Rpubs – (18) Subsets & Shrinkage Regression (Stepwise & Lasso)](http://rpubs.com/skydome20/R-Note18-Subsets_Shrinkage_Methods)
[R:Gradient Boosting](https://read01.com/zh-tw/amdPKx.html#.WzWCc-EzbaU)
[Rpubs – (16) Ensemble Learning](http://rpubs.com/skydome20/R-Note16-Ensemble_Learning)