Ch8 Adaptive Boosting

--- tags: 機器學習技法 --- Ch8 Adaptive Boosting === ## Content [TOC] ## [Slides & Videos](https://www.csie.ntu.edu.tw/~htlin/mooc/) ## Roadmap ![]() ## [Motivation of Boosting](https://www.youtube.com/watch?v=hL8DjIHAzZY&list=PLXVfgk9fNX2IQOYPmqjqWsNUFl2kpk1U2&index=30) - 這直接看過去ㄌ ## [Diversity by Re-weighting](https://www.youtube.com/watch?v=pTNKUj_1Dw8&list=PLXVfgk9fNX2IQOYPmqjqWsNUFl2kpk1U2&index=31) ### Bootstrapping as Re-weighting Process ![](https://i.imgur.com/E3mSjcc.png) - **可以把 bootstrap 看成是 re-weighting process** - $u_n^{(t)}$：第 $t$ 次 bootstrap 的第 $n$ 筆資料的 weight ### Weighted Base Algorithm ![](https://i.imgur.com/EfbZyht.png) - 那麼 re-weighting 之後要怎麼 train model? - SVM 就對應到調整 upper bound - 要回去複習一下推導過程 - logistic regression 就對應到調整 sample 的 probability - 較接近 [ML Foundations Ch8](https://hackmd.io/XXAdD3GHRX-8eGvGwLO7BA#Weighted-Classification) 說明類似的作法 class-weighted learning，而現在 example-weighted learning 是它的擴展 ### Re-weighting for More Diverse Hypothesis ![](https://i.imgur.com/bhbccFP.png) - 怎麼在 $t+1$ 時刻透過 re-weight 得到和 $g_t$ 不同的 hypothesis 呢？ - 找到一組 dataset 讓 $g_t$ 表現得很糟糕就可以了 - **所以就是找 $u^{(t+1)}$ 使得 $g_t$ 表現得像 random** - 即 error rate = 1/2 - > Q: 林老师，根据您讲的内容每一个u(t)的更新依据只是g(t-1)和u(t-1)，不用考虑t-1之前的g和u吗？会不会在某个轮次k出现u(t+k)和u(t)很接近从而使得g(t+k)和g(t)很相似呢？ >> 是的，g(t+k) 與 g(t) 有可能很相似甚至一樣 ### 'Optimal' Re-weighting ![](https://i.imgur.com/uBSlVHa.png) - $橘方_{t+1}$：$g_t$ 犯錯部分的 $u_n^{(t+1)}$ 加總 - $綠圓_{t+1}$：$g_t$ 正確部分的 $u_n^{(t+1)}$ 加總 - 希望在讓 $橘方_{t+1}$ & $綠圓_{t+1}$ 相等 - 所以就把錯誤部分的 weight 都乘上 $1-\epsilon_t$；正確部分都乘上 $\epsilon_t$ - $\epsilon_t$：第 $t$ 次錯誤率 ### Fun Time: Optimal Re-weighting ![](https://i.imgur.com/ghFjl6q.png) ## [Adaptive Boosting Algorithm](https://www.youtube.com/watch?v=vqTXLTYqbbw&list=PLXVfgk9fNX2IQOYPmqjqWsNUFl2kpk1U2&index=32) ### Scaling Factor ![](https://i.imgur.com/vXPSMBB.png) - 定義 scaling factor $菱形_t=\sqrt{\dfrac{1-\epsilon_t}{\epsilon_t}}$ - 錯的就把 weight 乘上 $菱形_t$ - 對的就把 weight 除以 $菱形_t$ - 這就跟 optimal re-weighting 是一樣的 - 物理意義 - 放大錯誤的 - 縮小正確的 - 這樣就可以有 diverse hypotheses ### A Preliminary Algorithm ![](https://i.imgur.com/laPwSMh.png) - $u^{(1)}=\frac 1N$ - $G(x)$ 不適合用 uniform 來結合 model - 因為 $g_2$ 的 $E_{in}$ 一定很差 (why? :-)) - 可使用任意方式組合，而接下來我們要用 linearly on the fly - 即一邊 aggregate，一邊計算 linear weight? ### Linear Aggregation on the Fly ![](https://i.imgur.com/ZTu2Hj5.png) - 每個 $g_t$ 所對應的 weight $\alpha_t=\ln(菱形_t)=\ln(\sqrt{\frac{1-\epsilon_t}{\epsilon_t}})$ - 這樣的 $\alpha_t$ 有以下性質 - 對於 $菱形_t$ 而言是 monotonic 的，所以 $\epsilon_t$ 越小($g_t$越好)，$\alpha_t$ 就越大 - $\alpha_t=\ln(\sqrt{\frac{1-\epsilon_t}{\epsilon_t}})=\frac 12(\ln(1-\epsilon_t)-\ln(\epsilon_t))$ - **Adaptive Boosting = weak base learning algorithm $\mathcal A$ + optimal re-weighting factor $菱形_t$ + 'magic' linear aggregation $\alpha_t$** - 老師、學生、班級的比喻可以複習 [Motivation of Boosting](https://www.youtube.com/watch?v=hL8DjIHAzZY&list=PLXVfgk9fNX2IQOYPmqjqWsNUFl2kpk1U2&index=30) ### Adaptive Boosting (AdaBoost) Algorithm ![](https://i.imgur.com/U26ZXWQ.png) - AdaBoost: 三個臭皮匠，勝過一個諸葛亮 - weak learner ### Theoretical Guarantee of AdaBoost ![](https://i.imgur.com/5ItZmso.png) - 本來小 $g$ 的 VC dimension 是 $d_{VC}$，組合之後所有可能大 $G$ 的 VC dimension 變成 $O(d_{VC}\cdot T\log T)$ - 只要做 $\log N$ 這麼多輪就可以把 $E_{in}$ 做到 0 - 前提是 $\epsilon_t$ - 那為何要加 big-O? $O(\log N)$ - 結論：AdaBoost 不但可以做到 $E_{in}$ 是 0，而且可以保證 $E_{out}$ 很小 - 前提 $\mathcal A$ 可以比亂猜好一點 ### Fun Time: AdaBoost Algorithm ![](https://i.imgur.com/kaPTqtK.png) ## [Adaptive Boosting in Action](https://www.youtube.com/watch?v=5wPN87bwoaE&list=PLXVfgk9fNX2IQOYPmqjqWsNUFl2kpk1U2&index=33) ### Decision Stump ![](https://i.imgur.com/GTcjfuz.png) - 之前講過的 decision stump 常常用來作為 AdaBoost 的 base algorithm - decision stump 有 3 個 parameters - feature $i$ - threshold $\theta$ - direction $s$ - 物理意義：以二維空間而言就是切垂直線 or 水平線 - 只需要花費 $O(d\cdot N\log N)$ 的時間就可以找到 decision stump 最佳解 - > Q: 有空想想怎麼算 ### A Simple Dataset 略 ### A Complicated Dataset 略 ### AdaBoost-Stump in Application ![](https://i.imgur.com/h6AHL7V.png) - > Q: 這邊沒很懂在幹嘛@@ - > 還可以當 feature selection @@ ### Fun Time: AdaBoost-Stump ![](https://i.imgur.com/BBbqEOF.png) ### Summary ![](https://i.imgur.com/cPEAmAT.png) - 之後要學的 **decision tree 其實就是一種 conditional aggregation**