# Boosting Algorithms: Adaboost, XGBoost, LightGBM, and CatBoost ## Introduction Boosting is a powerful ensemble learning technique that combines multiple weak learners (models slightly better than random guessing) to create a strong, highly accurate model. The core idea is to iteratively train models, with each new model focusing on correcting the mistakes of its predecessors. ### Key Principles of Boosting 1. **Sequential Training:** Boosting trains models sequentially, where each new model learns from the mistakes of the previous ones. 2. **Error Focus:** New models explicitly target the shortcomings of existing models by emphasizing misclassified data points. 3. **Weighted Combination:** The final prediction is a weighted average of the individual model predictions, with better-performing models given higher weights. ## AdaBoost (Adaptive Boosting) AdaBoost is a classic boosting algorithm that often uses decision trees (especially decision stumps) as weak learners. ### How AdaBoost Works 1. **Initialization:** Assign equal weights to all data points. 2. **Iteration:** * Train a weak learner on the weighted data. * Calculate the error of the weak learner. * Increase the weight of misclassified points and decrease the weight of correctly classified points. This emphasizes the hard-to-classify examples for the next iteration. 3. **Combination:** Combine the weak learners using a weighted majority vote, where weights are based on accuracy. ### Advantages and Disadvantages * **Advantages:** Simple, fast, versatile, handles mixed data types well. * **Disadvantages:** Sensitive to outliers, can overfit with noisy data. ### Algorithm :::success <font color=blue>Algorithm: AdaBoost.M1</font> **Input:** - Sequence of $N$ examples $\left\{(x_1, y_1), \ldots, (x_m, y_m) \right\}$ with labels $y_i \in Y = \{1, \ldots, k\}$ - Distribution $D$ over the $N$ examples - Weak learning algorithm **WeakLearn** - Integer $T$ specifying number of iterations **Initialize** $D_1(i)=1/m$ for all $i$. **Do for** $t=1, 2, \ldots, T$ 1. Call **WeakLearn**, providing it with the distribution $D_t$. 2. Get back a hypothesis $h_t : X \to Y$. 3. Calculate the error of $h_t: \epsilon_t = \sum\limits_{i:h_t(x_i) \neq y_i} D_t(i)$. If $\epsilon_t > \frac{1}{2}$, then set $T = t-1$ and abort loop. 4. Set $\beta_t = \frac{1 - \epsilon_t}{\epsilon_t}$ 5. Update distribution $D_t$: $$D_{t+1}(i) = \frac{D_t(i)}{Z_t} \times \begin{cases} \beta_t &\text{if } h_t(x_i) = y_i \\ 1 &\text{otherwise}\end{cases}$$ where $Z_t$ is a normalization constant (chosen so that $D_{t+1}$ will be a distribution) **Output** the final hypothesis: $$ h_f(x) = \mathop{\arg\max}_{y \in Y} \sum_{t:h_t(x) \neq y} \log \frac{1}{\beta_t} $$ ::: ### Code Implementation ```python= from sklearn.ensemble import AdaBoostClassifier from sklearn.tree import DecisionTreeClassifier # Create a weak learner (decision tree) weak_learner = DecisionTreeClassifier(max_depth=1) # Create the AdaBoost classifier model = AdaBoostClassifier(base_estimator=weak_learner, n_estimators=50) # Train the model on your data model.fit(X_train, y_train) ``` ## Gradient Tree Boosting Gradient Boosting is a generalization of boosting where models are built to minimize an arbitrary differentiable loss function. It often uses decision trees as base learners. ### How Gradient Tree Boosting Works 1. **Initialization:** Start with a simple model (e.g., mean prediction). 2. **Iteration:** * Calculate pseudo-residuals (negative gradients), indicating the current model's errors. * Fit a decision tree to these pseudo-residuals. * Find the optimal step size to update predictions based on the new tree. * Add the scaled tree's contribution to the model. 3. **Repeat:** Iterate the above steps for a set number of trees. #### Mathematical Formulation The algorithm seeks to approximate a function $F^*(x)$ that minimizes the expected value of a loss function $L(y, F(x))$. It does this with an additive model: $$ F^*(x) = \sum_{m=1}^M \gamma_m h_m(x) + \text{constant}, $$ where: * $h_m(x)$ are base functions (usually decision trees) * $\gamma_m$ are coefficients (weights) for each tree #### Algorithm :::success <font color=blue>Algorithm: Gradient Tree Boosting</font> **Input:** * Training set $\{(x_i, y_i)\}_{i=1}^n$ * Number of iterations $M$ * Choice of the loss function $L(y, F(x))$ **Initialize** $F_0(x) = \underset{\gamma}{\arg\min} \sum_{i=1}^n L(y_i, \gamma)$ **For** $m = 1$ to $M$: 1. Compute the negative gradient (pseudoresiduals) for $i=1,\ldots,n$: $$r_{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]{F(x)=F_{m-1}(x)}$$ 2. Fit a regression tree $h_m(x)$ to the targets $r_{im}$ giving terminal regions $R_{jm}$, $j=1,\ldots,J_m$. 3. For $j=1,\ldots,J_m$ compute: $$\gamma_{jm} = \underset{\gamma}{\arg\min} \sum_{x_i \in R_{jm}} L(y_i, F_{m-1}(x_i) + \gamma)$$ 4. Update the model: $$F_m(x) = F_{m-1}(x) + \sum_{j=1}^{J_m} \gamma_{jm} I(x \in R_{jm})$$ **Output** $F_M(x)$ ::: ### Advantages of Gradient Tree Boosting * **High Accuracy:** Often outperforms other algorithms in predictive power. * **Handles Diverse Data:** Works well with numerical, categorical, and mixed data types. * **Automatic Feature Selection:** The tree-based structure naturally selects the most important features. * **Robust to Missing Values:** No need for explicit imputation in most cases. * **Handles Feature Interactions:** Captures complex relationships between features. ### Limitations of Gradient Tree Boosting * **Computationally Intensive:** Training can be slow, especially for large datasets. * **Risk of Overfitting:** Requires careful tuning to avoid overfitting to the training data. * **Hyperparameter Sensitivity:** Performance can vary significantly depending on hyperparameter settings. * **Less Interpretable:** The ensemble nature can make it harder to understand the model's reasoning compared to a single decision tree. ## XGBoost (Extreme Gradient Boosting) XGBoost is a high-performance implementation of gradient boosting that has gained widespread popularity for its speed, scalability, and exceptional predictive power. It's a go-to algorithm for both machine learning competitions and real-world applications due to its ability to consistently achieve state-of-the-art results across a wide variety of tasks. ### Key Enhancements * **Regularized Objective:** XGBoost is a high-performance implementation of gradient boosting that has gained widespread popularity for its speed, scalability, and exceptional predictive power. It's a go-to algorithm for both machine learning competitions and real-world applications due to its ability to consistently achieve state-of-the-art results across a wide variety of tasks. * **Second-Order Optimization:** Unlike standard gradient boosting, which typically uses only first-order gradients, XGBoost employs second-order gradients (Hessian) during the optimization process. This allows for faster convergence and can lead to better solutions. * **Tree Pruning:** XGBoost intelligently prunes trees during the building process. It removes branches that have a negative impact on the model's performance, reducing complexity and enhancing both speed and generalization. * **Sparse Data Handling:** XGBoost is designed to handle sparse datasets efficiently. It includes specialized algorithms that automatically determine the optimal default direction for missing values, leading to faster training and improved accuracy. * **Parallel and Distributed Computing:** XGBoost is optimized for parallel and distributed computing environments. It leverages multi-threading and distributed processing to significantly accelerate the training process, especially on large datasets. ### How XGBoost Works XGBoost follows the core principles of gradient boosting, but with significant enhancements: 1. **Initialization:** Starts with an initial prediction (e.g., the average of the target variable). 2. **Iteration:** * **Gradient Calculation:** Computes the first-order gradients (the direction and magnitude of errors) and second-order gradients (Hessian, the curvature of errors) for each training example. * **Tree Building:** Constructs a decision tree that aims to minimize the objective function, which includes both a loss term (measuring prediction errors) and a regularization term (controlling model complexity). * **Optimal Leaf Values:** Determines the optimal values for each leaf of the tree to minimize the objective function. * **Tree Pruning:** Removes branches that have a negative impact or minimal contribution to the objective function. * **Model Update:** Adds the new tree to the ensemble, weighted by a learning rate. 3. **Repeat:** Repeats steps 2 for a specified number of iterations (trees). 4. **Prediction:** The final prediction is a weighted sum of the predictions from all the trees in the ensemble. #### Mathematical Formulation The core of XGBoost's objective function in the $t$-th iteration is: $$ \mathcal{L}^{(t)} \approx \sum_{i=1}^n\left[g_i f_t\left(\mathbf{x}_i\right)+\frac{1}{2} h_i f_t^2\left(\mathbf{x}_i\right)\right]+\Omega\left(f_t\right) $$ where: * $g_i$ and $h_i$ are first and second-order gradients, respectively. * $f_t(x_i)$ is the prediction of the $t$-th tree. * $\Omega(f_t)$ is the regularization term penalizing model complexity. ### Split Finding Algorithms A crucial step in building decision tree models is determining the optimal split points for each feature. The goal is to find splits that maximize the separation of different classes or minimize impurity. XGBoost offers several sophisticated algorithms to address this challenge: #### Exact Greedy Algorithm This algorithm exhaustively explores all possible split points for each feature. It operates as follows: 1. **Sorting:** For each feature, the algorithm sorts the training instances based on their feature values. 2. **Evaluation:** It iterates through the sorted values, evaluating the information gain (or another splitting criterion) at each potential split point. The gain is calculated using XGBoost's objective function, which balances both the improvement in purity (how well the split separates the classes) and the model's complexity. 3. **Selection:** The algorithm selects the split point that yields the highest gain. :::success <font color=blue>Algorithm: Exact Greedy Algorithm for Split Finding</font> * **Input**: $I$, instance set of current node * **Input:** $d$, feature dimension *gain* $\leftarrow 0$ $G \leftarrow \sum_{i \in I} g_i, H \leftarrow \sum_{i \in I} h_i$ * **for** $k=1$ to $m$ do $G_L \leftarrow 0, H_L \leftarrow 0$ * **for** $j$ in $\operatorname{sorted}(I,$ by $\mathbf{x}_{j k})$ do 1. $G_L \leftarrow G_L+g_j, H_L \leftarrow H_L+h_j$ 2. $G_R \leftarrow G-G_L, H_R \leftarrow H-H_L$ 3. *score* $\leftarrow \max \left(\right.$ *score*,$\left.\frac{G_L^2}{H_L+\lambda}+\frac{G_R^2}{H_R+\lambda}-\frac{G^2}{H+\lambda}\right)$ * **end** * **end** * **Output:** Split with max score ::: While the exact greedy algorithm guarantees the best possible split, it can be computationally expensive, especially for large datasets with many features or continuous features with a wide range of values. #### Approximate Algorithm To improve scalability, XGBoost also offers an approximate algorithm. This algorithm works as follows: 1. **Candidate Split Points:** Instead of evaluating all possible splits, it proposes a limited number of candidate split points. These candidates are typically based on quantiles (percentiles) of the feature distribution. 2. **Histogram Approximation:** The algorithm constructs histograms for each feature, summarizing the distribution of data points and their gradients within each quantile. 3. **Evaluation:** It evaluates the gain for each candidate split point using the histogram approximation. 4. **Selection:** The split point that offers the highest gain among the candidates is selected. :::success <font color=blue>Algorithm: Approximate Algorithm for Split Finding</font> * **for** $k=1$ to $m$ do * Propose $S_k=\left\{s_{k 1}, s_{k 2}, \cdots s_{k l}\right\}$ by percentiles on feature $k$. * Proposal can be done per tree (global), or per split(local). * **end** * **for** $k=1$ to $m$ do * $G_{k v} \leftarrow=\sum_{j \in\left\{j \mid s_{k, v} \geq \mathrm{x}_{j k}>s_{k, v-1}\right\}} g_j$ * $H_{k v} \leftarrow=\sum_{j \in\left\{j \mid s_{k, v} \geq \mathbf{x}_{j k}>s_{k, v-1}\right\}} h_j$ * **end** * Follow same step as in previous section to find max score only among proposed splits. ::: XGBoost provides two variants of the approximate algorithm: * **Global Variant:** Candidate split points are determined once before tree construction and used across all levels of the tree. This is faster but may lead to slightly less optimal splits. * **Local Variant:** Candidate split points are re-proposed at each level of the tree, potentially resulting in more accurate splits but with increased computational cost. #### Sparsity-Aware Split Finding XGBoost is designed to handle sparse datasets (where many feature values are missing) efficiently. Its sparsity-aware split finding algorithm gracefully handles missing values by: 1. **Default Direction:** Each node in the tree maintains a default direction (either the left or right child). 2. **Missing Value Assignment:** When encountering a missing value, the algorithm assigns the corresponding instance to the default direction, without calculating the gain for that specific instance. The default direction is chosen based on the direction that would result in the best gain overall, considering only the instances with non-missing values for the given feature. :::success <font color=blue>Algorithm: Sparsity-Aware Split Finding</font> * **Input:** $I$, instance set of current node * **Input:** $I_k=\left\{i \in I \mid x_{i k} \neq\right.$ missing $\}$ * **Input:** $d$, feature dimension Also applies to the approximate setting, only collect statistics of non-missing entries into buckets * *gain* $\leftarrow 0$ * $G \leftarrow \sum_{i \in I} g_i , H \leftarrow \sum_{i \in I} h_i$ * **for** $k=1$ to $m$ do * // enumerate missing value goto right * $G_L \leftarrow 0, H_L \leftarrow 0$ * **for** $j$ in $\operatorname{sorted}\left(I_k\right.$, ascent order by $\left.\mathbf{x}_{j k}\right)$ do 1. $G_L \leftarrow G_L+g_j, H_L \leftarrow H_L+h_j$ 2. $G_R \leftarrow G-G_L, H_R \leftarrow H-H_L$ 3. *score* $\leftarrow \max \left(\right.$ *score*, $\left.\frac{G_L^2}{H_L+\lambda}+\frac{G_R^2}{H_R+\lambda}-\frac{G^2}{H+\lambda}\right)$ * **end** * // enumerate missing value goto left * $G_R \leftarrow 0, H_R \leftarrow 0$ * **for** $j$ in sorted ( $I_k$, descent order by $\left.\mathbf{x}_{j k}\right)$ do 1. $G_R \leftarrow G_R+g_j, H_R \leftarrow H_R+h_j$ 1. $G_L \leftarrow G-G_R, H_L \leftarrow H-H_R$ 1. *score* $\leftarrow \max \left(\right.$ *score*, $\left.\frac{G_L^2}{H_L+\lambda}+\frac{G_R^2}{H_R+\lambda}-\frac{G^2}{H+\lambda}\right)$ * **end** * **end** * **Output:** Split and default directions with max gain ::: #### Weighted Quantile Sketch When dealing with datasets where instances have varying weights (e.g., due to imbalanced classes), XGBoost employs a weighted quantile sketch algorithm. This algorithm intelligently selects candidate split points based on the distribution of weighted data, ensuring that the chosen splits effectively consider the importance of different instances. ### Code Implementation ```python= import xgboost as xgb # Assuming you have X_train, y_train data model = xgb.XGBClassifier( n_estimators=100, # Number of trees learning_rate=0.1, # Step size shrinkage max_depth=3, # Maximum depth of each tree reg_lambda=1 # L2 regularization ) model.fit(X_train, y_train) ``` ### Key Considerations * **Hyperparameter Tuning:** XGBoost's performance is sensitive to hyperparameter settings. It's crucial to tune parameters like learning rate, tree depth, and regularization strength to achieve the best results. * **Interpretability:** While individual trees in XGBoost can be inspected, the overall model can be complex and less interpretable than simpler models. ## LightGBM: A High-Performance Gradient Boosting Framework LightGBM stands for "Light Gradient Boosting Machine." It's a gradient boosting framework designed to be fast, scalable, and accurate, especially when dealing with large datasets. It achieves this through several innovative techniques that set it apart from traditional gradient boosting methods. ### How LightGBM Works LightGBM's speed and efficiency primarily stem from two key optimizations: 1. **Gradient-based One-Side Sampling (GOSS):** * **Challenge:** In gradient boosting, not all training instances are equally important. Instances with small gradients (errors) are already well-predicted by the current model and contribute less to improving accuracy. * **LightGBM's Solution:** GOSS intelligently samples the training data. It retains instances with large gradients, as these are more informative for learning, and randomly samples a small fraction of instances with small gradients. To compensate for under-representing small gradient instances, GOSS assigns them higher weights, ensuring the model's overall learning direction isn't biased. :::success <font color=blue>Algorithm: Gradient-based One-Side Sampling</font> * **Input:** $I$ : training data, $d$ : iterations * **Input**: $a$ : sampling ratio of large gradient data * **Input**: $b$ : sampling ratio of small gradient data * **Input**: loss: loss function, $L$ : weak learner * models $\leftarrow\{\}$, fact $\leftarrow \frac{1-a}{b}$ * topN $\leftarrow \mathrm{a} \times \operatorname{len}(I)$, randN $\leftarrow \mathrm{b} \times \operatorname{len}(I)$ * **for** $i=1$ to $d$ do * preds $\leftarrow$ models.predict $(I)$ * $\mathrm{g} \leftarrow$ *loss*$(I, preds)$, $\mathrm{w} \leftarrow\{1,1, \ldots\}$ * $sorted \leftarrow$ $GetSortedIndices(abs(g))$ * $topSet \leftarrow sorted[1:topN]$ * $usedSet \leftarrow topSet + randSet$ * $\mathrm{w}[ randSet ] \times= fact \ \triangleright$ Assign weight fact to the small gradient data. * $newModel \leftarrow \mathrm{L}([usedSet], -\mathrm{g}$ [usedSet],$\mathrm{w}[$ usedSet]) * $models.append(newModel)$ ::: 2. **Exclusive Feature Bundling (EFB):** * **Challenge:** High-dimensional data (with many features) can slow down training and consume more memory. * **LightGBM's Solution:** EFB bundles mutually exclusive features together. Mutually exclusive features are those that rarely take non-zero values simultaneously. By bundling them, LightGBM reduces the effective number of features, resulting in faster training and lower memory usage. :::success <font color=blue>Algorithm: Greedy Bundling</font> * **Input:** $F$ : features, $K$ : max conflict count * Construct graph $G$ * searchOrder $\leftarrow G$.sortByDegree() * bundles $\leftarrow\{\}$, bundlesConflict $\leftarrow\{\}$ * **for** $i$ **in** *searchOrder* **do** * *needNew* $\leftarrow$ True * **for** $j=1$ **to** *len(bundles)* **do** 1. cnt $\leftarrow$ ConflictCnt(bundles$[j], F[\mathrm{i}]$ ) 1. **if** cnt + bundlesConflict $[i] \leq K$ then * bundles[j].add( $F[\mathrm{i}])$, *needNew* $\leftarrow$ False * break * if *needNew* then 1. Add $F[i]$ as a new bundle to bundles * **Output**: bundles ::: :::success <font color=blue>Algorithm: Merge Exclusive Features</font> * **Input**: numData: number of data * **Input**: $F$ : One bundle of exclusive features * binRanges $\leftarrow\{0\}$, totalBin $\leftarrow 0$ * **for** $f$ **in** $F$ **do** * totalBin $+=$ f.numBin * binRanges.append(totalBin) * newBin $\leftarrow$ new Bin(numData) * **for** $i=1$ **to** numData **do** * newBin $[\mathrm{i}] \leftarrow 0$ * for $j=1$ to $\operatorname{len}(F)$ do 1. **if** $F[j] . b i n[i] \neq 0$ **then** 2. newBin$[i] \leftarrow F[\mathrm{j}].bin[i]$ + binRanges$[j]$ * **Output**: newBin, binRanges ::: In addition to GOSS and EFB, LightGBM uses two other techniques that further enhance its performance: 3. **Histogram-based Algorithm:** * Instead of finding the best split by examining all possible feature values, LightGBM discretizes continuous features into bins and constructs histograms. These histograms summarize the distribution of data and gradients within each bin. The algorithm then efficiently finds the optimal split point by evaluating these histograms, drastically reducing computational cost. :::success <font color=blue>Algorithm: Histogram-based Algorithm</font> **Input:** * $I$: Training dataset * $d$: Maximum tree depth * $m$: Number of features 1. **Initialize:** * *nodeSet* $\leftarrow\{0\} \triangleright$ tree nodes in current level * *rowSet* $\leftarrow\{\{0,1,2, \ldots\}\} \triangleright$ data indices in tree nodes 2. **for** $i=1$ to $d$ do * **for** node in *nodeSet* do * *usedRows* $\leftarrow$ rowSet[node] * **for** $k=1$ to $m$ do * $H \leftarrow$ new Histogram( ) * $\triangleright$ Build histogram * **for** $j$ in *usedRows* do 1. $bin \leftarrow I . \mathrm{f}[\mathrm{k}][\mathrm{j}].bin$ 1. $H [bin].y \leftarrow H[ bin].y + I.y[j]$ 1. $H[ bin].n \leftarrow H[ bin]. \mathrm{n}+1$ * Find the best split on histogram $H$ 3. **Output:** the constructed tree ::: 4. **Leaf-wise Tree Growth:** * Unlike most decision trees, which grow level-wise, LightGBM grows trees in a leaf-wise manner. It chooses to split the leaf that will result in the greatest reduction of the loss function. This approach can lead to faster convergence and potentially better accuracy, but it may also increase the risk of overfitting, so careful hyperparameter tuning is necessary. ![Screenshot 2024-06-06 at 7.22.01 PM](https://hackmd.io/_uploads/SJ2JnM1rC.png) ### LightGBM in a Nutshell 1. **Initialization:** Start with a simple model (e.g., predicting the mean). 2. **Iteration:** * **Construct Histograms:** Create histograms for each feature based on binned values. * **GOSS Sampling:** Sample training data using the GOSS technique. * **Leaf-wise Growth:** Grow the tree leaf-wise, choosing splits that maximize the reduction in the loss function. * **EFB (Optional):** If applicable, bundle mutually exclusive features. * **Add to Ensemble:** Add the new tree to the ensemble of trees. 3. **Repeat:** Repeat steps 2 for a specified number of iterations. 4. **Prediction:** The final prediction is a weighted sum of the predictions from all trees in the ensemble. ### Advantages of LightGBM * **Faster Training:** Significantly faster than many other gradient boosting implementations due to GOSS and EFB. * **Lower Memory Usage:** More efficient memory usage due to the same optimizations. * **Higher Accuracy:** Often achieves comparable or better accuracy than other boosting algorithms. * **Parallel and GPU Learning:** Supports parallel and GPU learning for further acceleration. * **Categorical Feature Handling:** Can directly handle categorical features without one-hot encoding. ### Code Implementation ```python= import lightgbm as lgb from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # ... (Prepare your data: X_train, X_test, y_train, y_test) # Create LightGBM Dataset train_data = lgb.Dataset(X_train, label=y_train) # Set Hyperparameters (Customize as needed) params = { 'boosting_type': 'gbdt', 'objective': 'binary', 'metric': 'binary_logloss', 'num_leaves': 31, 'learning_rate': 0.05, 'feature_fraction': 0.9 } # Train the Model num_round = 100 model = lgb.train(params, train_data, num_round) # Make Predictions and Evaluate y_pred = model.predict(X_test) # ... (Evaluate accuracy as needed) ``` ## CatBoost: A Gradient Boosting Library for Categorical Features CatBoost distinguishes itself through its innovative techniques for handling categorical features and preventing overfitting in gradient boosting models: ### How CatBoost Works CatBoost's unique approach to handling categorical features revolves around two main concepts: #### Ordered Target Statistics (Ordered TS) * **Challenge:** Traditional gradient boosting algorithms often require categorical features to be one-hot encoded. This can lead to several issues: * **High Dimensionality:** One-hot encoding can create many new features, increasing the model's complexity and potentially leading to overfitting. * **Target Leakage:** When calculating target statistics (average target value for each category), using information from all training instances can introduce target leakage, where future information is used to predict the past. * **CatBoost's Solution:** * **Target Statistics (TS):** CatBoost converts categorical features into numerical target statistics instead of one-hot encoding. The target statistic for a category represents the average target value for the training instances belonging to that category. This reduces dimensionality and provides a meaningful representation of the feature. * **Ordered TS:** To prevent target leakage, CatBoost introduces ordered target statistics. The calculation is modified to use only the information from instances that precede the current instance in a randomly shuffled order. #### Ordered Boosting * **Challenge:** Traditional gradient boosting can also suffer from prediction shift, where the model learns from its errors on future data points, potentially leading to overfitting. * **CatBoost's Solution:** Ordered boosting addresses this by aligning the calculation of gradients (errors) with the ordered target statistics. When building each new tree, the gradients are calculated using a model that has only been trained on the instances preceding the current instance in the permutation. This helps create a more robust and accurate model. :::success <font color=blue>Algorithm: Ordered boosting</font> **input:** * Training data: $\left\{\left(\mathrm{x}_k, y_k\right)\right\}_{k=1}^n$ * Number of iterations: $I$ 1. Create a random permutation $\sigma$ of the training instances. 2. Initialize the model $M_0$ to predict the average target value. 3. For $t = 1$ to $I$: * Calculate the residuals (errors) $r_i = y_i - M_{i-1}(x_i)$ for each instance $x_i$ using the previous model $M_{t-1}$. * Build a new model $\Delta M_t$ using the residuals $r_i$ as the target values and only the instances preceding $x_i$ in the permutation $\sigma$. * Update the model: $M_t = M_{t-1} + \Delta M_t$ 4. Return the final model $M_I$. ::: #### Oblivious Trees In addition to Ordered TS and Ordered Boosting, CatBoost leverages oblivious decision trees as its base learners. * **What are Oblivious Trees?** Unlike traditional decision trees, where each node can have a different splitting criterion, oblivious trees are symmetrical. This means they use the same splitting criterion for all nodes at a given level of the tree. * **Why Oblivious Trees?** * **Faster Predictions:** Oblivious trees have a simpler structure, leading to faster prediction times compared to non-symmetric trees. * **Reduced Overfitting:** The balanced nature of oblivious trees often helps reduce overfitting by preventing the model from becoming overly complex. * **Alignment with Ordered Boosting:** The symmetrical structure of oblivious trees aligns well with the principles of ordered boosting, as it maintains a consistent decision path for all instances at the same level, regardless of their position in the permutation. :::success <font color=blue>Algorithm: Building an Oblivious Decision Tree in CatBoost</font> **Input:** * Training examples $\{(x_i, y_i)\}_{i=1}^n$ * Number of tree levels $d$ * Number of permutations $s$ 1. Generate $s+1$ random permutations of the training examples: $\sigma_0, \sigma_1, \ldots, \sigma_s$. 2. Initialize the tree structure $T$ with a single leaf. 3. For each level $l = 1, \ldots, d$: * For each permutation $\sigma_r \in {\sigma_1, \ldots, \sigma_s}$: * Compute the ordered target statistics for each categorical feature using $\sigma_r$. * Calculate the gradients $g_i$ and hessians $h_i$ for each example using the current model predictions. * Find the best split criterion $c_l$ based on the gradients and hessians. * Split each leaf in $T$ according to $c_l$, creating new leaves. 4. For each permutation $\sigma_r \in {\sigma_1, \ldots, \sigma_s}$: * Compute the ordered target statistics for each categorical feature using $\sigma_r$. * Calculate the leaf values for each leaf in $T$ by averaging the gradients of the examples in that leaf. 5. Compute the final leaf values by averaging the leaf values from all permutations. **Output:** The oblivious decision tree $T$ with the final leaf values. ::: ### CatBoost in a Nutshell Here's a simplified summary of the CatBoost algorithm: 1. **Initialization:** Start with a simple model (e.g., predicting the mean). 2. **Iteration:** * **Calculate Ordered TS:** Compute ordered target statistics for all categorical features. * **Calculate Gradients and Hessians:** Calculate the first and second derivatives of the loss function with respect to the current model's predictions. * **Build an Oblivious Tree:** Grow a tree based on the gradients, ordered TS, and optional boosting scheme (plain or ordered boosting). * **Add to Ensemble:** Add the new tree to the ensemble, scaled by a learning rate. 3. **Repeat:** Repeat the above steps for a specified number of iterations (trees). 4. **Output:** The final prediction is a weighted sum of the predictions from all trees in the ensemble. ### Advantages of CatBoost * **Handles Categorical Features Effectively:** Eliminates the need for extensive categorical feature engineering or one-hot encoding. * **Reduced Overfitting:** Ordered TS and boosting help prevent overfitting and target leakage. * **Fast Training (especially on GPUs):** Can be significantly faster than other boosting libraries, especially on larger datasets. * **High Performance:** Often achieves state-of-the-art results, particularly on datasets with categorical features. * **Interpretability Tools:** Provides tools for model analysis and understanding feature importance. ### Limitations of CatBoost * **Memory Usage:** Can be memory-intensive for large datasets or complex models. * **Hyperparameter Tuning:** Still requires some hyperparameter tuning, though less than traditional gradient boosting methods. ### Code Implementation ```python= from catboost import CatBoostClassifier # Create a CatBoostClassifier instance model = CatBoostClassifier(iterations=100, learning_rate=0.1, depth=6) # Fit the model model.fit(X_train, y_train) # Predict predictions = model.predict(X_test) ``` ## Comparison | Feature | AdaBoost | XGBoost | LightGBM | CatBoost | |------------------------|----------------------------------------------|-------------------------------------------------|------------------------------------------------|------------------------------------------------| | **Strengths** | Simple, easy to interpret, less prone to overfitting | High accuracy, scalable, handles sparse data well | Fast, efficient, low memory footprint, handles large datasets | Excels with categorical features, high accuracy, robust to overfitting | | **Key Features** | Adaptive boosting, decision tree base learners | Regularized gradient boosting, parallel processing, tree pruning | Gradient-based one-side sampling (GOSS), histogram-based algorithm, leaf-wise tree growth, exclusive feature bundling (EFB) | Ordered target statistics, ordered boosting, oblivious trees, fast GPU training | | **Year Published** | 1995 | 2014 | 2016 | 2017 | | **Speed** | Moderate | Fast | Very Fast | Fast | | **Memory Usage** | Low | Moderate | Low | Moderate to high | | **Hyperparameter Tuning** | Relatively easy | Can be complex, many options | Moderate | Relatively easy | | **Best For** | Smaller datasets, interpretability | Large datasets, high performance | Large datasets, limited resources | Datasets with categorical features | ## Example: Titanic - Machine Learning from Disaster https://www.kaggle.com/code/yuhsianh/boosting ## Reference * Jerome H. Friedman "Greedy function approximation: A gradient boosting machine.," The Annals of Statistics, Ann. Statist. 29(5), 1189-1232, (October 2001) [link](https://projecteuclid.org/journals/annals-of-statistics/volume-29/issue-5/Greedy-function-approximation-A-gradient-boostingmachine/10.1214/aos/1013203451.full?tab=ArticleLink) * Yoav Freund, Robert E Schapire, A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting, Journal of Computer and System Sciences, Volume 55, Issue 1, 1997, Pages 119-139 [link](https://www.sciencedirect.com/science/article/pii/S002200009791504X) * Yoav Freund and Robert E. Schapire. 1996. Experiments with a new boosting algorithm. In Proceedings of the Thirteenth International Conference on International Conference on Machine Learning (ICML'96). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 148–156. [link](https://dl.acm.org/doi/10.5555/3091696.3091715) * [XGBoost](https://xgboost.readthedocs.io/) * Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16). Association for Computing Machinery, New York, NY, USA, 785–794. [link]( https://doi.org/10.1145/2939672.2939785) * [LightGBM](https://lightgbm.readthedocs.io/) * Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: a highly efficient gradient boosting decision tree. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17). Curran Associates Inc., Red Hook, NY, USA, 3149–3157. [link](https://papers.nips.cc/paper_files/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html) * [CatBoost](https://catboost.ai/) * Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin. 2018. CatBoost: unbiased boosting with categorical features. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS'18). Curran Associates Inc., Red Hook, NY, USA, 6639–6649. [link](https://papers.nips.cc/paper_files/paper/2018/hash/14491b756b3a51daac41c24863285549-Abstract.html)