Ch10 Random Forest

--- tags: 機器學習技法 header-includes: - \usepackage[ruled,vlined,linesnumbered]{algorithm2e} --- Ch10 Random Forest === ## Content [TOC] ## [Slides & Videos](https://www.csie.ntu.edu.tw/~htlin/mooc/) ## [Random Forest Algorithm](https://www.coursera.org/learn/machine-learning-techniques/lecture/YnV6g/random-forest-algorithm) ### Recall: Bagging and Decision Tree - Bagging 可以降低 variance - Decision Tree 的 variance 大 (資料變一點點，樹就變很多，因為切的刀會不一樣) - aggregation of aggregation !!! ### Random Forest (RF) - random forest = bagging + fully-grown C\&RT decision tree - C\&RT 的好處都能夠被 random forest 繼承下來 - 能夠處理 categorical features - 能夠處理 missing features - C\&RT 有很大的 variance、容易 overfit，被 random forest 緩解 ### Diversifying by Feature Projection - 如何能有更不一樣的 function 呢 - 可以 sample $d'$ 個 features，可以視為某種 feature transform (projection 到低維空間) - 轉換後的空間 $\mathcal Z\in\mathbb R^{d'}$ 其實是原本的 feature 空間 $\mathcal X\in\mathbb R^d$ 的一個 random subspace - > ***Q: 該複習一下線代ㄌQQ*** - RF 原作者建議 tree 在長每個 branch 的時候都可以重新 sample 一次 features，這樣每棵樹會更不一樣 ### Diversifying by Feature Expansion - 那既然新的 space 是原特徵空間的 subspace，那就可以把 feature transform $\phi(x)$ 看成是把 feature $x$ 乘上一個投影矩陣 $P$，這個矩陣 $P$ 的每個 row 都是一個 natural basis (standard basis，某 element 為 1，其他為 0) - 這樣就可以看成是每一個 row $p_i$ 有一個固定的方向，但是其實也可以不需要是 natural basis (單一的方向)呀 - 因此，也可以是將 feature 做 combination 之後讓 branch 做切分。 - 也就是說 $p_i$ 不一定是 [0,0,1,0] 這種 vector，可能是 [1,3,4,2] - 但其實也不需要考量所有 feature，$p_i$ 可以只考慮少數 features，例如 [0,0,...,0,1,2,0] - **RF = bagging + random combination C\&RT** ## [Out-Of-Bag Estimate](https://www.coursera.org/learn/machine-learning-techniques/lecture/MoJQo/out-of-bag-estimate) ### Bagging Revisited - resample 來訓練 $g_t$ 的 data 總會有一些是沒被 sample 到的 - 這些就被稱作 out-of-bag (OOB) examples of $g_t$ ### Number of OOB Examples - 當 $N'=N$ - 某筆 data 沒被選到的機率是 $(1-\frac 1 N)^N$ - $(1-\frac 1 N)^N=\frac{1}{(\frac{N}{N-1})^N}=\frac{1}{(1+\frac{1}{N-1})^N}$ - 當 $N$ 很大，分母會趨近 $e$，因此該項趨近 $\frac 1 e$ - > 印象中最原本應該是 $e=\lim_\limits{N\rightarrow\infty}(1+\frac 1 N)^N$? - 因此 OOB 的資料約佔總 data 的 $\frac 1 e$，即 $\frac 1 e N$ ### OOB versus Validation - 要用 OOB 來 validate $g_t$ 嗎? 其實不太需要，因為我們不要求 $g_t$ 很強，只要結合後的 $G$ 很強就好 - 那用 OOB 來 validate $G$? - (看 slide 較清楚) 第 $n$ 筆 data 可以用來衡量「由它作為 OOB data 時建構出的各 $g_t$ 們」平均得到的 $G_n^-$ - 於是 **OOB error** $E_{oob}(G)=\frac 1 N \sum_{n=1}^N\textrm{err}(y_n,G_n^-(x_n))$ - 很類似 leave-one-out cross-validation ### Model Selection by OOB Error - 和一般 validation 相比 - 是 self-validation，不需要重新訓練來選 hyperparameter - 例如 $d''$ (feature transform 所需要的 feature 數量) ## [Feature Selection](https://www.coursera.org/learn/machine-learning-techniques/lecture/ogflJ/feature-selection) ### Feature Selection - 想要移除 - redundant features - irrelevant features (與要預測的目標無關) - 好處 - 後續可以得到更簡單的 hypothesis 以及更快速的 prediction time - feature noise 被移除了 - 有更好的可解釋性 - 壞處 - 在選 feature 的過程很耗費計算資源 (例如 10000維要選 300維，可能要找遍 $C^{10}_3$ 種組合 - 選 feature 可能會 overfit (組合太多) - 若是 overfit，則解釋出來的東西可能錯誤。又或者解釋出來的只有關聯性，而不是因果關係 - > Q: 可是因果關係本來就很難直接從 data 當中得到吧? - 因此 feature selection 階段的責任其實很重 - decision tree 是少數 built-in feature selection 的 model ### Feature Selection by Importance - 只考慮 importance，挑 top-$d''$ 個 features - importance by linear model - 直接用 $|w_i|$ 選 - non-linear model 通常就比較難衡量 ### Feature Importance by Permutation Test - 如果第 $i$ 個 feature 很重要，那麼把 $x_{n,i}$ (第 $n$ 筆資料的第 $i$ 個 feature) random 替換掉的話應該會損害 model performance - 要用怎樣的 random values 來替換本來的 feature 呢? - uniform / Gaussian ? 可是這樣會改變原來的 distribution $P(x_i)$ - 用 bootstrap 的方式 re-sample 該 feature - 或者用 permutation 的方式直接重新排列該 feature 在 data 中的順序 !!! - permutation test - $\textrm{importance}(i)=\textrm{performance}(\mathcal D) - \textrm{performance}(\mathcal D^{(p)})$ where $\mathcal D^{(p)}$ is $\mathcal D$ with $\{x_{n,i}\}$ replaced by permuted $\{x_{n,i}\}_{n=1}^N$ - 常用在各種 non-linear model 例如 RF ### Feature Importance in Original Random Forest - 就直接在 OOB data 上做 permutation - RF solution: $\textrm{importance}(i)=E_{oob}(G)-E_{oob}^{(p)}(G)$ - RF feature selection via **permutation + OOB** ## [Random Forest in Action](https://www.coursera.org/learn/machine-learning-techniques/lecture/5N9Q7/random-forest-in-action) ### A Simple Dataset - (看 slide 較清楚) - RF 隨著 tree 越多，越能做出類似 large-margin 的效果。 ### How Many Trees Needed? - the more, the better - RF 的缺點：他是一個有 randomness 的演算法 - 因此當這個隨機性的結果還不夠穩定 (例如少一棵樹就有差)，就應該增加 tree 的數量