Ch7 Blending and Bagging

--- tags: 機器學習技法 --- Ch7 Blending and Bagging === ## Content [TOC] ## [Slides & Videos](https://www.csie.ntu.edu.tw/~htlin/mooc/) ## Roadmap ![](https://i.imgur.com/M3GjBZD.png) ## [Motivation of Aggregation](https://www.youtube.com/watch?v=mjUKsp0MvMI&list=PLXVfgk9fNX2IQOYPmqjqWsNUFl2kpk1U2&index=27) ### An Aggregation Story 假設有 $T$ 個朋友在預測股票是否會漲，那麼要怎麼利用這些朋友的預測呢？ ![](https://i.imgur.com/dDjGdZY.png) - **select**：挑一個最會的朋友，照他說的就對ㄌ - 其實這種就是 validation - **uniformly** mix：大家投票，每人票數相同 - **non-uniformly** mix：大家投票，每人票數不一定相同 - **conditionally** combine：如果第 $t$ 個人符合某些 condition，就讓他參與投票，每人票數不一定相同 ### Aggregation with Math Notations ![](https://i.imgur.com/vxZyYS6.png) - **select** - $G(x)=g_{t_*}(x)$ with $t_*=\arg\min_{t\in\{1,2,...,T\}}E_{val}(g_t^-)$ - $g^-$ 的意義可以參考 [機器學習基石 Ch15](https://hackmd.io/@johnnyasd12/Sk9V-hwAH#Validation-Set-D_val) - **uniformly** - $G(x)=\text{sign}(\sum_{t=1}^T1\cdot g_t(x))$ - **non-uniformly** - $G(x)=\text{sign}(\sum_{t=1}^T\alpha_t\cdot g_t(x))$ with $\alpha_t\geq 0$ - 這個就包含了 **select、uniformly** - select：$\alpha_t=1[\![\text{smallest }E_{val}(g_t^-)]\!]$ - uniformly：$\alpha_t=1$ - **conditionally** - $G(x)=\text{sign}(\sum_{t=1}^Tq_t(x)\cdot g_t(x))$ with $q_t(x)\geq 0$ - **寫作 $q_t(x)$ 因為這個 weight 會受到 $x$ 所影響** - 這個就包含了 **non-uniformly** (所以也包含 uniformly、select) - non-uniformly： $q_t(x)=\alpha_t$ ### Recall: Selection by Validation ![](https://i.imgur.com/20MCrZ8.png) - 利用 **selection** by validation 會得到好的 $E_{out}$ 的前提是：有個很強的 $g_t^-$ 使得 $E_{val}$ 足夠小 - 但是 aggregation 是說，雖然很多 hypotheses 都很 weak，那能不能結合起來做得更好？ ### Why Might Aggregation Work? ![](https://i.imgur.com/WvVetjb.png) - aggregation 的效果，以 uniformly mix 來說 - 讓 hypotheses 變得更 strong，所以有 **feature transform** 的效果 - 讓 hypothesis 的結果變得比較中庸，像是在選 fattest boundary 一樣，所以有 **regularization** 的效果 - 很特別的是，aggregation 居然可以同時有兩個不同面向的效果 - > Q: 真的可以同時變複雜 & regularization 嗎@@ 圖上畢竟是兩種不同的 model 做 aggregation 哪，應該是某些情況會變複雜，另一些情況可以 regularization?? - > Q: 這樣的話是哪些情況會增加 complexity，哪些會是 regularization?? > A?: 看圖感覺就是 overfitting 的時候可以達到 regularization；underfitting 的時候可以提升 complexity (吧 ### Fun Time: Aggregation ![](https://i.imgur.com/IrjX1wz.png) - 以 x=-1 和 x=1 為邊界分開討論「投票結果」即可 - 從這結論可以看出，**aggregation 的確可以讓 model 變得更複雜。** ## [Uniform Blending](https://www.youtube.com/watch?v=DAFkKJYTMW4&list=PLXVfgk9fNX2IQOYPmqjqWsNUFl2kpk1U2&index=27) ### Uniform Blending (Voting) for Classification ![](https://i.imgur.com/fXZTpae.png) - uniform blending - 所謂 **blending**，就是說 $g_t$ 已經知道了 - 所謂 **uniform**，就是說每個 $g_t$ 恰有一票 - 當大家的 prediction 很不一樣時，多數可以「修正」少數的 prediction (這種方法的假設是「多數會是正確的」) - **multiclass classification** 也可以這樣做，不過作法就變成說「**計算哪個 class 得到的票數較多**」了 - $G(x)=\arg\max_\limits{1\leq k\leq K}\sum_\limits{t=1}^T1[\![g_t(x)=k]\!]$ - 那 regression 要怎麼 uniform blending？ ### Uniform Blending for Regression ![](https://i.imgur.com/G6gJuMu.png) - 要除以 $T$ 做平均 - 若大家 prediction 不一樣，那有的可能低估 $f(x)$，有的可能高估，那麼做平均可能就較接近真正的 $f(x)$。 - 前提也同樣是 **diverse** hypotheses - even simple uniform blending can be better than any single hypothesis - > Q: 這句不太認同欸，better than **any** simple hypothesis??? > 根據下一頁，應該要寫成 **better than average of all simple hypothesis** 吧@@ ### Theoretical Analysis of Uniform Blending (for Regression) ![](https://i.imgur.com/pMv0LBo.png) - 這兩個區塊的 $x$ 是 fixed 的，所以先省略 - 中間區塊的 $avg(\cdot)$ 用來代表 $\frac 1T\sum_{t=1}^T[\cdot]$ - 這裡的證明要注意 - row 2: 利用 $avg(g_t)=G$ 做代換 - row 5: 利用 $G=avg(g_t)$ 做代換 - 最後終於得證 $avg(g_t-f)^2=avg((g_t-G)^2)+(G-f)^2$ - 把固定的 $x$ 推廣到 $E_{out}$ 可以得到 $avg(E_{out}(g_t))=avg(\mathbb E(g_t-G)^2)+E_{out}(G)\geq E_{out}(G)$ - > Q: 這裡的 $avg$ 跟上面一樣ㄇ @@ > 好像一樣 ### Some Special $g_t$ (Bias-Variance Decomposition?) ![](https://i.imgur.com/90zOFev.png) - 假設現在有一種虛擬的程序，執行 $T$ 次 (說它虛擬因為的確實務上不太可能這樣) 1. 每次我們都從 data distribution $P$ 中 sample 出**不同的** $N$ 筆資料 $D_t$ 2. 利用演算法 $\mathcal A$ 來得到一個 $g_t$ - 這樣執行完 $T$ 次，我們就會得到 $T$ 個 $g_t$ - $\bar g=\lim_\limits{T\rightarrow\infty}G=\lim_\limits{T\rightarrow\infty}\frac 1T\sum_\limits{t=1}^Tg_t=\mathbb E_\mathcal D\mathcal A(\mathcal D)$ - 意思是說，當 $T$ 趨近無窮大，我們就可以知道這個演算法 $\mathcal A$ 在任何 $N$ 筆資料 $\mathcal D\sim P$ 上的 performance 期望值 - 那就來把上個 slide 的 $G$ 變成無窮多次的 blending，變成 $\bar g$ 然後再看一次式子ㄅ - $avg(E_{out}(g_t))=avg(\mathbb E(g_t-\bar g)^2)+E_{out}(\bar g)$ - $avg(E_{out}(g_t))$ 演算法的表現的期望值 - $\bar g$ 又稱 **consensus**(共識) - $E_{out}(\bar g)$ 又稱 **bias**，是 consensus 的 performance - $avg(\mathbb E(g_t-\bar g)^2)$ expected deviation to consensus，又稱 **variance** - > Q: 這裡的 $\mathbb E$ 是指 $\mathbb E_{x,y\in\mathcal D_t}$ 嗎？ - 所以**演算法的 expected performance(error) = bias + variance** - 又稱 **bias-variance decomposition** - **uniform blending 其實就是在減少 variance** - > Q: 因為這樣 $g_t$ 會更接近 $\bar g$ 嗎？ ### Fun Time: Uniform Blending ![](https://i.imgur.com/7xn2pL6.png) - linear model 的 blending 還是 linear model。 ## [Linear and Any Blending](https://www.youtube.com/watch?v=i03s1g7X_m4&list=PLXVfgk9fNX2IQOYPmqjqWsNUFl2kpk1U2&index=28) ### Linear Blending ![](https://i.imgur.com/0U5nIPg.png) - 每個 $g_t$ 給不同的票數比例 $\alpha_t\geq 0$ - 優化 $\alpha\in\mathbb R^T$，即 $\min_\limits{\alpha_t\geq 0}E_{in}(\alpha)$ - linear blending for regression 可以視為 linear regression + transformation，只要把 hypothes**es** 視為 **feature transform** 即可 - 所以其實 linear blending 跟之前的 two-level learning, probabilistic SVM 很像 - **linear blending = LinModel + hypotheses as transform + constraints** - 可是多了 constraints 要怎麼解? ### Constraints on $\alpha_t$ ![](https://i.imgur.com/gnI2xTT.png) - $\alpha_t\geq 0$ 這樣的 constraints 其實可以去掉 - 因為如果 $\alpha_t<0$ 那其實就是把 classifier 反過來用而已，效果不一定會比較差。 - 所以實務上不會有這樣的 constraints ### Linear Blending versus Selection ![](https://i.imgur.com/G8dyn2z.png) - 實務上，$g_1,...,g_T$ 通常是從不同的 hypotheses set 得到的 optimal solution of $E_{in}$ - 之前說過，如果用 $E_{in}$ 來做 selection，那要付出的代價就是 $d_{VC}(\bigcup_\limits{t=1}^T\mathcal H_t)$ - 而 linear blending 就已經**包含了 selection 的做法**，而且是用 aggregation 的方式結合各種 hypotheses set，所以**付出的代價是會 $\geq d_{VC}(\bigcup_\limits{t=1}^T\mathcal H_t)$** - **所以我們通常用 $E_{val}$ 代替 $E_{in}$ 來做 blending，且為了讓 $E$ 跟 $g$ 互相獨立，會使用 $g^-$ 來評估，而不是 $g$** - > Q: 之前有說過為何要互相獨立ㄇ @@? ### Any Blending ![](https://i.imgur.com/EsnkxZK.png) 1. 用 $\mathcal D_{train}$ 得到一堆 $g_t^-$，當成 feature transform function $\phi^-$ 2. 把 $\phi^-$ 用在 $\mathcal D_{val}$ 上，訓練出一個新的 model $\tilde g$ 3. 要 predict 的時候使用的 hypothesis 是 $G_{ANYB}(x) = \tilde g(\phi(x))$ - 這裡注意是用 $\phi(x)$ 而不是 $\phi^-(x)$ - **any blending = stacking** - **powerful**(non-linear), achieves **conditional blending** - > Q: 思考為何 any blending 是 conditional blending - danger of **overfitting** ### Blending in Practice ![](https://i.imgur.com/d0AXCsG.png) - 台大 2011 KDDCUP 世界冠軍 - 在 validation set 做 any blending 得到 $G$ - 在特別的 test set 做 linear blending (of $g$ and $G$) - > Q:不太清楚這邊的 test set 是指什麼 ### Fun Time: Blending as Transform ![](https://i.imgur.com/FGjTPUK.png) ## [Bagging (Bootstrap Aggregation)](https://www.youtube.com/watch?v=3T1mdvzRAF0&list=PLXVfgk9fNX2IQOYPmqjqWsNUFl2kpk1U2&index=29) ### What We Have Done ![](https://i.imgur.com/yG6KEGH.png) - **blending：已經得到 $g_t$ 之後要怎麼結合** - 那能不能「一邊學 $g_t$ 一邊學怎麼結合」？ - 之前說過 aggregation 在 $g_t$ 之間不一樣的時候可以做得好，那要怎麼讓 $g_t$ 不一樣? - 不同 **model** - 不同 **(hyper)parameter** - algorithmic 的隨機性 - data 的隨機性 - 像是結合 cross-validation 的不同 $g_v^-$ - 接下來要講的就是**利用data 的隨機性，「拿同一份資料」來做出不同的 $g_t$** ### Revisit of Bias-Variance ![](https://i.imgur.com/W3R1qqF.png) - [之前](https://hackmd.io/@johnnyasd12/SkxlsunxU#Some-Special-g_t-Bias-Variance-Decomposition)證明，不同的 data 訓練出來的 $g_t$ 才可以得出結論說，consensus 會比只用一個 $g_t$ 好 - 這裡我們需要做兩點妥協 (沒辦法得到真正的 $\bar g$) - **不是做無窮多次的平均** - **不是從真正的 distribution $P$ sample 出 $N$ 筆資料** - **那就用 bootstrapping 來重新 sample 資料，這樣就可以 simulate $D_t$** ### Bootstrap Aggregation ![](https://i.imgur.com/PbHLqyU.png) - 取出放回 sample 出 $N'$ 筆資料 $\tilde D_t$，稱為 bootstrap sample - 這邊一樣加上 $\tilde \,$ 代表跟原本的 $D_t$ 不太一樣 1. 使用 bootstrapping 得到 $T$ 次 data $\tilde D_t$ 2. 套用 algorithm $\mathcal A(\tilde D_t)$ 得到 $g_t$ > Q: 這樣能不能直接寫成 $g_t=\mathcal A(\tilde D_t)$？？ 3. $G=\text{Uniform}(\{g_t\})$ - 這種方法被稱為 **bootstrap aggregation (BAGging)** - 因為 BAGging 是 based on 別的 algorithm，我們稱之為 **meta algorithm** - 它底下的 algorithm 這時候就叫 **base algorithm $\mathcal A$** ### Bagging Pocket in Action ![](https://i.imgur.com/Ucaxnoa.png) - **當 base algorithm 很容易被 data randomness 影響，那麼 bagging 就可以做得好** - 因為這樣可以做出更不一樣的 $g_t$ ### Fun Time: Bootstrap Aggregation (BAGging) ![](https://i.imgur.com/ZljrngJ.png) ### Summary ![](https://i.imgur.com/19XhVAa.png)