# XGBoost outline
## Ensemble learning
### Why? Bias-variance tradeoff
- **Bias**:
- The difference between the average prediction of our model and the correct value which we are trying to predict
- Model with high bias pays very little attention to the training data and oversimplifies the model. → Too simple → **Underfitting**
- **Variance**:
- The variability of model prediction for a given data point or a value which tells us spread of our data
- Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before. → Too complex → **Overfitting**

The tradeoff in complexity of the model is why there is a tradeoff between bias and variance. A learning algorithm can’t be more complex and less complex at the same time.
Since a single model is hardly perfect (low bias and low variance), we call these the "weak" models. So why don't we combine these "weak" models to create a "strong" model?
### How?
Ensemble learning
- Bagging: Multiple similar models, trained independently from subsamples of the same dataset. The results from each model will be averaged to get the final result => Reduces overall variance
- Boosting: Multiple similar models, trained sequentially: the next model will be more focused on the previous one's failures => More focused on reduces bias, but since it's similar to take "weighted average" of multiple models, variance reduces accordingly.
- Stacking: Multiple models, not necessary to be similar in architecture, with a meta model (supervisor model), which will take results from those models as inputs and return a single prediction.
## Boosting
### Why?
Only difference between bagging and boosting is that, bagging runs the models parallelly, while boosting runs them sequentially.

This is to make sure that those models are being trained in the correct direction, since the failures will be more concerned in the next iteration.
Note that there is a trade off to this effect: We can't train the models parallelly like that of Bagging anymore, since the output of previous model is required to run the subsequent one.
### How?
Optimization problem: Let's consider each model is simply $\hat{y_i}=w_i.x$, where $i$ specifies the i-th model in the chain, we need to minimize the loss:
$$L(y, \sum_{i=1}^N{(c_iw_i)x})$$
Where $c_i$ is the confidence level of the i-th model.
Instead of trying to find the optimal solution by optimizing all $c_i$ and $w_i$ at once, what boosting aims to is to find local solution for each $c_i$, $w_i$, based on all previous results. Let's rewrite the problem statement as:
$$argmin_{c_{1..N},w_{1..n}}{L(y, \sum_{i=1}^N{(c_iw_i)x})}$$
$$argmin_{c_i,w_i}{L(y, W_{i-1} + (c_iw_i)x)}$$
$$L(y, W_{i-1}+c_iw_i)$$, where $W_{i-1} = \sum_{j=1}^{i-1}{(c_jw_j)x}$
#### **Adaptive Boosting (AdaBoost)**
As explained, after each iteration, we will increase the weight of wrong guesses, so that the next iteration will be more focused into that.
How AdaBoost implemented is to:
- Initialize all weights as $1/N$ for each sample
- Run the iteration
- Calculate confidence score $c_i$ based on the total loss
- Validates and assign new weights for each sample (increase for failures, decrease for successes)
- Run the next iteration, rinse, repeat.
#### **Gradient Boosting**
> Chad AdaBoost
// sth sth
**Base learner**
Wise man once said:
> When in doubt, all-in for CART (Classification And Regression Trees)
*I thought that ANN or perceptron should perform better? Why don't use them instead?*
2 reasons:
- Base learner should be weak
- There will be multiple base learners, which are trained sequentially. So a slight increase in base learner's complexity can take you a lot.
// Add sth more too
#### **XGBoost**
> Gradient boosting on steroids
- Split-finding algorithms
- Exact greedy
- Approximate:
- Weighted quantile sketch
- Sparsity-aware
- Parallel learning:
- Column block:
- For exact greedy, stores all in 1 block to perform sorting
- For approximate, can split in multiple blocks to sort on multiple machines or stored on disk in case of out-of-core
- Cache-aware
- Out-of-core computations
*Do we really have to do these? :sadge:*
##### **Demo**
- Runs the Colab notebook
- Can try visualizing split finding algorithms
## Bucket list
- LightGBM: improved from XGBoost
## References
- https://viblo.asia/p/ensemble-learning-va-cac-bien-the-p1-WAyK80AkKxX
- https://viblo.asia/p/gradient-boosting-tat-tan-tat-ve-thuat-toan-manh-me-nhat-trong-machine-learning-YWOZrN7vZQ0
- https://xgboost.readthedocs.io/en/stable/tutorials/model.html
- https://dl.acm.org/doi/pdf/10.1145/2939672.2939785
- http://colt2008.cs.helsinki.fi/papers/26-Shwartz.pdf