XGBoost outline

# XGBoost outline ## Ensemble learning ### Why? Bias-variance tradeoff - **Bias**: - The difference between the average prediction of our model and the correct value which we are trying to predict - Model with high bias pays very little attention to the training data and oversimplifies the model. → Too simple → **Underfitting** - **Variance**: - The variability of model prediction for a given data point or a value which tells us spread of our data - Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before. → Too complex → **Overfitting** ![Bias and variance using bulls-eye diagram](https://i.imgur.com/WEXVSHp.png) The tradeoff in complexity of the model is why there is a tradeoff between bias and variance. A learning algorithm can’t be more complex and less complex at the same time. Since a single model is hardly perfect (low bias and low variance), we call these the "weak" models. So why don't we combine these "weak" models to create a "strong" model? ### How? Ensemble learning - Bagging: Multiple similar models, trained independently from subsamples of the same dataset. The results from each model will be averaged to get the final result => Reduces overall variance - Boosting: Multiple similar models, trained sequentially: the next model will be more focused on the previous one's failures => More focused on reduces bias, but since it's similar to take "weighted average" of multiple models, variance reduces accordingly. - Stacking: Multiple models, not necessary to be similar in architecture, with a meta model (supervisor model), which will take results from those models as inputs and return a single prediction. ## Boosting ### Why? Only difference between bagging and boosting is that, bagging runs the models parallelly, while boosting runs them sequentially. ![Types of learning diagram](https://i.imgur.com/iRvAV0O.png) This is to make sure that those models are being trained in the correct direction, since the failures will be more concerned in the next iteration. Note that there is a trade off to this effect: We can't train the models parallelly like that of Bagging anymore, since the output of previous model is required to run the subsequent one. ### How? Optimization problem: Let's consider each model is simply $\hat{y_i}=w_i.x$, where $i$ specifies the i-th model in the chain, we need to minimize the loss: $$L(y, \sum_{i=1}^N{(c_iw_i)x})$$ Where $c_i$ is the confidence level of the i-th model. Instead of trying to find the optimal solution by optimizing all $c_i$ and $w_i$ at once, what boosting aims to is to find local solution for each $c_i$, $w_i$, based on all previous results. Let's rewrite the problem statement as: $$argmin_{c_{1..N},w_{1..n}}{L(y, \sum_{i=1}^N{(c_iw_i)x})}$$ $$argmin_{c_i,w_i}{L(y, W_{i-1} + (c_iw_i)x)}$$ $$L(y, W_{i-1}+c_iw_i)$$, where $W_{i-1} = \sum_{j=1}^{i-1}{(c_jw_j)x}$ #### **Adaptive Boosting (AdaBoost)** As explained, after each iteration, we will increase the weight of wrong guesses, so that the next iteration will be more focused into that. How AdaBoost implemented is to: - Initialize all weights as $1/N$ for each sample - Run the iteration - Calculate confidence score $c_i$ based on the total loss - Validates and assign new weights for each sample (increase for failures, decrease for successes) - Run the next iteration, rinse, repeat. #### **Gradient Boosting** > Chad AdaBoost // sth sth **Base learner** Wise man once said: > When in doubt, all-in for CART (Classification And Regression Trees) *I thought that ANN or perceptron should perform better? Why don't use them instead?* 2 reasons: - Base learner should be weak - There will be multiple base learners, which are trained sequentially. So a slight increase in base learner's complexity can take you a lot. // Add sth more too #### **XGBoost** > Gradient boosting on steroids - Split-finding algorithms - Exact greedy - Approximate: - Weighted quantile sketch - Sparsity-aware - Parallel learning: - Column block: - For exact greedy, stores all in 1 block to perform sorting - For approximate, can split in multiple blocks to sort on multiple machines or stored on disk in case of out-of-core - Cache-aware - Out-of-core computations *Do we really have to do these? :sadge:* ##### **Demo** - Runs the Colab notebook - Can try visualizing split finding algorithms ## Bucket list - LightGBM: improved from XGBoost ## References - https://viblo.asia/p/ensemble-learning-va-cac-bien-the-p1-WAyK80AkKxX - https://viblo.asia/p/gradient-boosting-tat-tan-tat-ve-thuat-toan-manh-me-nhat-trong-machine-learning-YWOZrN7vZQ0 - https://xgboost.readthedocs.io/en/stable/tutorials/model.html - https://dl.acm.org/doi/pdf/10.1145/2939672.2939785 - http://colt2008.cs.helsinki.fi/papers/26-Shwartz.pdf

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.