# Tree-Based Methods Tree-based methods offer an interpretable and powerful approach to modeling complex relationships in data. This note explores decision trees for regression and classification problems, strategies to enhance accuracy, and comparisons with linear models and ensemble methods. ## Decision Trees Basics Decision trees recursively partition the predictor space into simple regions. New observations are assigned predictions based on which region they fall into. ### Regression Trees #### Terminologies * **Recursive Binary Splitting:** Repeatedly dividing the predictor space into smaller regions based on optimal splitting rules that minimize a metric like RSS. * **Terminal Nodes/Leaves:** Final partitions where predictions are made. * **Internal Nodes:** Split points defined by a feature and threshold. * **Branches:** Connections between nodes representing decision logic. ![0_12CT72krFYLtGFGx](https://hackmd.io/_uploads/rkWnOXK-0.png) #### Building a Tree 1. **Recursive Binary Splitting:** Iteratively split the predictor space until meeting a stopping criterion: * Define half-planes $R_1(j, s) = \{X | X_j < s\}, R_2(j, s) = \{X | X_j \geq s\}$ * Find $j$ and $s$ minimizing $\text{RSS} = \sum_{j=1}^J \sum_{i \in R_j}(y_i-\hat y_{R_j})^2$, where $$R_1(j, s) = \{X | X_j < s\}, \quad R_2(j, s) = \{X | X_j \geq s\}$$ is the **splitting rule** here. Note that the feature we choose to split can be chosen again in subsequent splits. * Repeat the whole process but instead of splitting the entire predictor space, we split one of the previously identified regions. 2. **Predictions:** For a new observation in region $R_j$, predict the mean response of training data in $R_j$, denoted by $\hat{y}_{R_j}$. 3. **Stopping criterion** such as maximum depth of a tree or minimum number of observations. However, fully grown trees tend to overfit the training data. #### Pruning To prevent overfitting, we can prune a large initial tree back to an optimal subtree: :::success <font color=blue>Algorithm: Pruned Decision Tree (Cost-Complexity Pruning via Cross-Validation)</font> **Input:** * Training data: Predictors $X=(x_1, \dots, x_p)$, response $Y$ * Number of cross-validation folds K **Output:** Pruned decision tree model **Procedure:** 1. **Grow the Initial Tree:** * Use a standard decision tree algorithm (e.g., CART) to grow a large, potentially overfit tree on the entire training dataset. Let's call this tree $T_0$. 2. **Cost-Complexity Pruning:** * For a sequence of increasing complexity parameter values $\alpha$: * Prune the tree $T_0$ using cost-complexity pruning. * This involves iteratively collapsing internal nodes that provide the smallest increase in the cost complexity criterion: $$\text{Cost Complexity} = \sum_{m=1}^{|T|} \sum_{i:x_i \in R_m} (y_i - \hat{y}_{R_m})^2 + \alpha |T|$$ * Store the resulting pruned tree for each $\alpha$ value. 3. **Cross-Validation for Optimal $\alpha$:** * Divide the training data into K folds. * For each fold `k` (1 to K): * Use the remaining K-1 folds as training data. * Repeat steps 1 and 2 to grow and prune trees for each $\alpha$ value. * Evaluate the performance (e.g., mean squared error) of each pruned tree on the held-out fold `k`. * Average the performance across all folds for each $\alpha$ value. * Select the $\alpha$ value that results in the lowest average error. 4. **Final Pruned Tree:** * Retrieve the pruned tree from step 2 that corresponds to the optimal $\alpha$ value. ::: #### Key Improvements and Considerations * **Explicit Cross-Validation:** This pseudocode clearly outlines the cross-validation process for choosing the best $\alpha$ (pruning parameter). This is crucial for finding a tree that generalizes well to unseen data. * **Cost Complexity Pruning:** This is a common and effective pruning method that balances model complexity (number of terminal nodes) with predictive accuracy. * **Error Metric:** The pseudocode uses mean squared error for regression. For classification, you'd use a suitable metric like misclassification rate or cross-entropy. * **Alternative Pruning:** There are other pruning techniques, such as Reduced Error Pruning (REP) and Minimum Error Pruning (MEP), that you could explore. #### Example: Predicting Housing Prices with Regression Trees ##### Libraries and Data ```python= import numpy as np import pandas as pd from matplotlib import pyplot as plt from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeRegressor, plot_tree from sklearn.metrics import mean_squared_error from sklearn.ensemble import RandomForestRegressor # Assuming you have the 'ISLP' module for data loading from ISLP import load_data from ISLP.models import ModelSpec ``` ##### Prepare the Dataset ```python=+ Boston = load_data("Boston") # Create design matrix (X) and target vector (y) model = ModelSpec(Boston.columns.drop('medv'), intercept=False) D = model.fit_transform(Boston) feature_names = list(D.columns) X = np.asarray(D) y = Boston['medv'] ``` ##### Build and Visualize the Initial Tree ```python=+ # Split data into training (70%) and testing (30%) sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) # Train a regression tree with limited depth reg = DecisionTreeRegressor(max_depth=3) reg.fit(X_train, y_train) # Visualize the tree fig, ax = plt.subplots(figsize=(12, 12)) plot_tree(reg, feature_names=feature_names, ax=ax); plt.show() ``` ![下載 (31)](https://hackmd.io/_uploads/Hk61aOo-0.png) ##### Prune the Tree with Cost Complexity Pruning ```python=+ # Cost complexity pruning path = reg.cost_complexity_pruning_path(X_train, y_train) ccp_alphas = path.ccp_alphas # Find the optimal alpha using cross-validation grid = skm.GridSearchCV(reg, {'ccp_alpha': ccp_alphas}, cv=5, scoring='neg_mean_squared_error') grid.fit(X_train, y_train) best_tree = grid.best_estimator_ ``` ##### Evaluate the Pruned Tree ```python=+ # Performance on the test set test_mse = mean_squared_error(y_test, best_tree.predict(X_test)) rmse = np.sqrt(test_mse) print("Root Mean Squared Error (RMSE) of the pruned tree:", rmse) ``` ```python Root Mean Squared Error (RMSE) of the pruned tree: 5.2980994280736216 ``` ##### Visualize the Tree ```python=+ ax = subplots(figsize=(12,12))[1] plot_tree(grid.best_estimator_ , feature_names=feature_names , ax=ax) plt.show() ``` ![下載 (30)](https://hackmd.io/_uploads/rJkXpusbR.png) ### Classification Trees Classification trees work similarly but predict a qualitative response by: * **Splitting Criteria:** Instead of using RSS, classification trees rely on metrics like the *Gini index, entropy, or classification error rate* to determine the optimal splits. These metrics measure the impurity (or heterogeneity) of nodes. * **Prediction:** For each new observation, the tree assigns the most common class label among the training observations in the terminal node (leaf) where the observation ends up. #### Common Splitting Criteria * **Classification Error Rate:** $E = 1 - \max_k(\hat{p}_{mk})$ * $\hat{p}_{mk}$ is the proportion of training observations in the m-th region that are from the k-th class. * This metric is less sensitive for tree growth. * **Gini Index:** $G = \sum_{k=1}^{K} \hat{p}_{mk}(1 - \hat{p}_{mk})$ * The Gini index is low when most observations within a node belong to the same class (high purity). * **Entropy:** $D = - \sum_{k=1}^{K} \hat{p}_{mk} \log \hat{p}_{mk}$ * Entropy is similar to the Gini index, favoring node purity. #### Example: Predicting Car Seat Sales with Classification Trees **Goal:** Develop a classification tree to predict whether car seat sales will be high (`Sales > 8`) or low based on product features and marketing factors. ##### Prepare the Data ```python= from sklearn.tree import DecisionTreeClassifier from sklearn import model_selection, metrics from ISLP import load_data, ModelSpec # Load the Carseats dataset Carseats = load_data('Carseats') # Create labels for high vs. low sales High = np.where(Carseats.Sales > 8, "Yes", "No") # One-hot encode categorical features (assumed by model.fit_transform) model = ModelSpec(Carseats.columns.drop('Sales'), intercept=False) X = model.fit_transform(Carseats) # Transformed feature matrix feature_names = list(X.columns) ``` ##### Build an Initial Tree ```python=+ clf = DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=0) clf.fit(X, High) # Evaluate on the training set train_accuracy = accuracy_score(High, clf.predict(X)) print("Training Accuracy:", train_accuracy) # Visualize the tree from sklearn.tree import plot_tree import matplotlib.pyplot as plt fig, ax = plt.subplots(figsize=(12, 12)) plot_tree(clf, feature_names=feature_names, ax=ax, class_names=["No", "Yes"]); plt.show() ``` ```python Training Accuracy: 0.79 ``` ![下載 (32)](https://hackmd.io/_uploads/BkOCbtiZC.png) ##### Prune with Cost Complexity Pruning ```python=+ X_train, X_test, High_train, High_test = model_selection.train_test_split( X, High, test_size=0.5, random_state=0 ) path = clf.cost_complexity_pruning_path(X_train, High_train) ccp_alphas = path.ccp_alphas # Use cross-validation to tune alpha grid = model_selection.GridSearchCV( DecisionTreeClassifier(criterion='entropy', random_state=0), {'ccp_alpha': ccp_alphas}, cv=5, # 5-fold cross-validation scoring='accuracy' ) grid.fit(X_train, High_train) best_tree = grid.best_estimator_ ``` ##### Evaluate the Pruned Tree ```python=+ # Visualize the pruned tree fig, ax = plt.subplots(figsize=(12, 12)) plot_tree(best_tree, feature_names=feature_names, ax=ax, class_names=["No", "Yes"]); plt.show() # Performance on test data test_accuracy = accuracy_score(High_test, best_tree.predict(X_test)) print("Test Accuracy:", test_accuracy) # Confusion Matrix print(confusion_table(best_tree.predict(X_test), High_test)) ``` ![下載 (33)](https://hackmd.io/_uploads/rkOlftsbC.png) ```python Test Accuracy: 0.675 Truth No Yes Predicted No 93 40 Yes 25 42 ``` ### Trees Versus Linear Models * Linear Regression * **Model:** Assumes a linear relationship between the target variable and features: $$f(X) = \beta_0 + \sum_{j=1}^{p} \beta_j X_j$$ * **Pros:** Simple, interpretable, often works well for linear relationships. * **Cons:** Limited for complex non-linear patterns, may need transformations for categorical features. * Decision Trees * **Model:** Partitions the feature space into regions, predicting a constant within each: $$f(X) = \sum_{m=1}^{M} c_m \cdot \mathbb{1}_{(X \in R_m)}$$ * **Pros:** Interpretable, handle non-linearities and mixed data types well, good for feature selection. * **Cons:** Prone to overfitting (addressed by pruning or ensemble methods), can be less stable than linear models. ### Advantages and Disadvantages * **Advantages:** * Easy to interpret and visualize * Can handle mixed data types naturally * Automatic feature selection * **Disadvantages:** * Prone to overfitting if not pruned * Less accurate than some methods for simple linear patterns * Instability - minor data changes can significantly impact structure ## Ensemble Methods Ensemble methods are a powerful machine learning technique that strategically combine multiple simple models (often referred to as "weak learners") to create a more robust and accurate final model. The underlying principle is that a group of diverse models, each with its own strengths and weaknesses, can collectively outperform any individual model. ### Bagging (Bootstrap Aggregation) * **Goal:** Reduce variance and overfitting of individual trees * **Key Idea:** 1. **Bootstrap Sampling:** Create multiple bootstrap samples by randomly drawing observations from the original dataset with replacement. Each sample has the same size as the original data. 2. **Model Training:** Fit a separate model (often a decision tree) to each bootstrap sample. 3. **Aggregation:** For prediction, combine the results from all the individual models: * **Regression:** Average the predictions. * **Classification:** Take a majority vote. ![An-example-of-bootstrap-sampling-Since-objects-are-subsampled-with-replacement-some](https://hackmd.io/_uploads/Hy7BnNtZC.png) #### Why Bagging Works * **Reducing Variance:** By training models on diverse samples of the data, bagging reduces model sensitivity to specific variations in the training set. Individual models might overfit, but their errors tend to average out when combined. * **Overfitting and Number of Trees:** While increasing trees in bagging typically improves accuracy, there's a point of diminishing returns. Overly complex ensembles can begin to overfit. #### Out-of-Bag (OOB) Error Estimation * **Purpose:** OOB error provides a convenient way to estimate prediction error during the bagging process itself, reducing the need for a separate validation set. * **Mechanism:** * Each observation is left out ("out-of-bag") in roughly 1/3 of the bootstrap samples. * To predict an observation's value, use only those trees where it was OOB. * Average or take a majority vote of these predictions. * **Benefits:** OOB error offers a computationally efficient and often reliable assessment of model performance. ### Random Forests * **Key Idea:** Introduce randomness when growing trees to further decorrelate them * At each split, only consider a random subset of features * **Mechanism:** When building each tree, random forests introduce randomness at the split level: 1. **Bootstrap Sampling:** As in bagging, each tree is trained on a random sample of the data drawn with replacement. 2. **Random Feature Selection:** At each split, only a random subset $m$ of the total predictors $p$ is considered. A common choice is $m = \sqrt{p}$. * **Impact:** This forced randomness injects diversity into the trees, making them less correlated and improving the error reduction from averaging or voting. * **Relationship to Bagging:** If $m = p$ (all features considered at each split), random forests become equivalent to bagging. #### Example: Predicting Housing Prices ##### Setup ```python= from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestRegressor # Load and prepare data Boston = load_data("Boston") model = ModelSpec(Boston.columns.drop('medv'), intercept=False) D = model.fit_transform(Boston) feature_names = list(D.columns) X = np.asarray(D) y = Boston['medv'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) ``` ##### Bagging ```python=+ bag_boston = RandomForestRegressor(max_features=X.shape[1], random_state=0) bag_boston.fit(X_train, y_train) y_hat_bag = bag_boston.predict(X_test) test_mse_bagging = np.mean((y_test - y_hat_bag)**2) print(f"Test MSE for Bagging: {test_mse_bagging:.2f}") ``` ```python Test MSE for Bagging: 14.63 ``` ##### Random Forest ```python=+ RF_boston = RandomForestRegressor(max_features=6, random_state=0) RF_boston.fit(X_train, y_train) y_hat_RF = RF_boston.predict(X_test) test_mse_rf = np.mean((y_test - y_hat_RF)**2) print(f"Test MSE for Random Forest: {test_mse_rf:.2f}") ``` ```python Test MSE for Random Forest: 20.04 ``` ##### Variable Importance ```python=+ feature_imp = pd.DataFrame( {'importance': RF_boston.feature_importances_}, index=feature_names ) feature_imp.sort_values(by='importance', ascending=False) ``` ```python importance lstat 0.356203 rm 0.332163 ptratio 0.067270 crim 0.055404 indus 0.053851 dis 0.041582 nox 0.035225 tax 0.025355 age 0.021506 rad 0.004784 chas 0.004203 zn 0.002454 ``` ### Boosting * **Intuition:** Boosting is an iterative ensemble method that aims to convert weak learners (often decision trees) into a strong, highly accurate model by focusing on their shortcomings. * **Key Ideas:** 1. **Sequential Training:** Unlike bagging or random forests, boosting trains trees sequentially. 2. **Focus on Errors:** Each new tree is trained specifically to minimize errors made by the previous trees. The algorithm gives more weight to observations that were difficult to predict correctly. 3. **Weighted Combination:** The final prediction is a weighted combination of all the trees, giving more importance to trees that are better at reducing errors. #### Algorithm :::success <font color=blue>Algorithm: Boosting (Gradient Boosting for Regression)</font> **Input:** * Training data: Predictors $X=(x_1, \dots, x_p)$, response $Y$ * Number of boosting iterations B * Learning rate $\lambda$, also called shrinkage * Loss function (typically mean squared error for regression) * Base learner (usually a decision tree with a limited depth) **Output:** Boosted model $\hat{f}(x)$ **Procedure:** 1. **Initialize:** * Set the initial prediction for all training instances to zero: $\hat{f}(x_i)=0$ * Set the initial residuals equal to the observed values: $r_i=y_i$ 2. **Iterate (Boosting Rounds):** * For b = 1 to B: * **Fit Base Learner:** * Fit the base learner (decision tree) to the training data, using the current residuals $r_i$ as the response variable. * Obtain the predictions from this tree: $\hat{f_b}(x_i)$ * **Compute Pseudo-Residuals (Negative Gradient):** * Calculate the negative gradient of the loss function with respect to the current predictions: $$\text{pseudo_residuals_i} = - \frac{\partial \text{Loss}(y_i, \hat{f}(x_i))}{\partial\hat{f}(x_i)}$$ * **Line Search (Optimal Step Size):** * Find the optimal step size (line search) that minimizes the loss function when adding a scaled version of the tree's predictions to the current model: $$\gamma = \arg\min_{\gamma} \left\{ \sum \text{Loss}(y_i, \hat{f}(x_i) + \gamma \hat{f}_b(x_i)) \right\}$$ * **Update Model:** * Update the model by adding the scaled tree's predictions: $$\hat{f}(x_i) = \hat{f}(x_i) + \lambda \gamma \hat{f}_b(x_i)$$ * **Update Residuals:** $$r_i = y_i - \hat{f}(x_i)$$ 3. **Output Boosted Model:** * Return the final boosted model: $$\hat{f}(x) = \sum_{b=1}^{B} \lambda \hat{f}_b(x)$$ ::: ##### Key Points * **Sequential Fitting:** Boosting is an iterative process where each new model (base learner) is trained to correct the errors of the previous ones. * **Gradient Descent:** Boosting is essentially gradient descent in function space. The pseudo-residuals guide the direction of improvement. * **Regularization:** The learning rate $\lambda$ is a crucial regularization parameter that controls the contribution of each tree, preventing overfitting. * **Loss Function:** The choice of loss function depends on the problem (e.g., mean squared error for regression, logistic loss for classification). * **Base Learners:** Decision trees are popular base learners, but other models can be used (e.g., linear models). * **Variations:** There are many variations of boosting (e.g., AdaBoost, XGBoost, LightGBM) that offer different approaches to tree building, loss functions, and optimization. #### Tuning Parameters * **Number of Trees $B$:** More trees lead to a more complex model but it tend to be overfitting. Use cross-validation to find the best value. * **Shrinkage $\lambda$:** Limits the influence of each tree, controlling the learning rate. Smaller values (e.g., 0.01) typically require more trees for good performance. * **Depth of Trees $d$:** Shallow trees (often stumps like $d=1$) are common in boosting. This controls the degree of interaction allowed among features. #### Example: Predicting Housing Prices ```python= from sklearn.ensemble import GradientBoostingRegressor from sklearn.datasets import load_boston from sklearn.model_selection import train_test_split import matplotlib.pyplot as plt # Load and prepare data Boston = load_data("Boston") model = ModelSpec(Boston.columns.drop('medv'), intercept=False) D = model.fit_transform(Boston) feature_names = list(D.columns) X = np.asarray(D) y = Boston['medv'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) # Train a Gradient Boosting Regressor boost_boston = GradientBoostingRegressor(n_estimators=5000, learning_rate=0.001, max_depth=3, random_state=0) boost_boston.fit(X_train, y_train) # Track error over iterations test_error = np.zeros_like(boost_boston.train_score_) for i, y_pred in enumerate(boost_boston.staged_predict(X_test)): test_error[i] = np.mean((y_test - y_pred)**2) # Visualize plt.figure(figsize=(8, 8)) plt.plot(boost_boston.train_score_, label='Training') plt.plot(test_error, label='Test') plt.legend() plt.show() # Final evaluation y_hat_boost = boost_boston.predict(X_test) test_mse = np.mean((y_test - y_hat_boost)**2) print(f"Test MSE: {test_mse:.2f}") ``` ```python Test MSE: 14.48 ``` ![下載 (34)](https://hackmd.io/_uploads/B1qnbAsZA.png) ### Bayesian Additive Regression Trees (BART) BART is a flexible, non-parametric machine learning method combining decision trees with a Bayesian framework. It offers good performance with minimal tuning, making it a strong choice for many regression tasks. #### Algorithm :::success <font color=blue>Algorithm: BART</font> **Input:** * Training data: Predictors $X=(x_1,\dots,x_p)$, response $Y$ * Number of trees $K$ * Number of iterations $B$ * Burn-in period $L$ * Prior distributions for tree parameters (e.g., tree depth, node splitting probabilities) **Output:** Posterior distribution of the ensemble model $\hat{f}(x)$ **Procedure:** 1. **Initialization:** * For each of the K trees, initialize the prediction for all training instances to the global mean of the response: $$\hat{f}_{k}^1(x_i) = \frac{1}{n} \sum_{i=1}^{n} y_i, \quad k=1,2,\dots,K$$ * Set the initial ensemble model as the average of the individual trees: $$\hat{f}_1(x) = \frac{1}{K} \sum_{k=1}^{K} \hat{f}_{k}^1(x)$$ 2. **MCMC Iterations (Posterior Sampling):** * For b = 2 to B: * **Update Each Tree:** * For k = 1 to K: * **Calculate Partial Residuals:** * For each training instance `i`: $$r_i = y_i - \sum_{k'<k} \hat{f}_{k'}^b(x_i) - \sum_{k'>k} \hat{f}_{k'}^{b-1}(x_i)$$ (This is the residual after accounting for the predictions of trees before and after the current tree in the ensemble.) * **Propose a New Tree:** * Generate a proposal for a new tree $\hat{f}_k^b(x)$ by randomly perturbing the tree $\hat{f}_k^{b-1}(x)$ from the previous iteration. * Perturbations can include changes to splitting variables, split points, or leaf node values. * **Metropolis-Hastings Step:** * Accept or reject the proposed tree based on the Metropolis-Hastings acceptance criterion, which depends on the likelihood of the data given the proposed tree and the prior distribution of tree parameters. * **Update Ensemble Model:** * Compute the new ensemble model: $$\hat{f}_b(x) = \frac{1}{K} \sum_{k=1}^{K} \hat{f}_{k}^b(x)$$ 3. **Output:** * Discard the first L samples (burn-in). * Average the remaining B-L ensemble models to obtain the final posterior distribution of the model: $$\hat{f}(x) = \frac{1}{B - L} \sum_{b=L+1}^{B} \hat{f}^b(x)$$ ::: #### Key Improvements and Considerations * **Bayesian Framework:** BART uses Bayesian inference to estimate the posterior distribution of the model, providing uncertainty estimates for predictions. * **MCMC Sampling:** Markov Chain Monte Carlo (MCMC) sampling is used to explore the posterior distribution of the model parameters (tree structures and leaf values). * **Tree Perturbations:** The algorithm randomly modifies the trees to explore different model structures and avoid getting stuck in local optima. * **Prior Distributions:** Prior distributions on tree parameters help regularize the model and prevent overfitting. #### Advantages * **Strong Performance:** BART often yields accurate predictions out-of-the-box. * **Robustness:** Tree perturbations and limitations on tree complexity help prevent overfitting. * **Interpretability:** While not as directly interpretable as single trees, feature importance can be extracted. #### Example: Boston Housing Prices ```python= from ISLP.bart import BART # Load and prepare data # Load and prepare data Boston = load_data("Boston") model = ModelSpec(Boston.columns.drop('medv'), intercept=False) D = model.fit_transform(Boston) feature_names = list(D.columns) X = np.asarray(D) y = Boston['medv'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) # Train the BART model bart_boston = BART(random_state=0, burnin=5, ndraw=15) bart_boston.fit(X_train, y_train) # Evaluation yhat_test = bart_boston.predict(X_test.astype(np.float32)) mse = np.mean((y_test - yhat_test)**2) print(f"Test MSE: {mse:.2f}") # Variable importance var_inclusion = bart_boston.variable_inclusion_ print(var_inclusion.sort_values(ascending=False)) ``` ```python Test MSE: 22.15 crim 26.933333 zn 27.866667 indus 26.466667 chas 22.466667 nox 26.600000 rm 29.800000 age 22.733333 dis 26.466667 rad 23.666667 tax 24.133333 ptratio 24.266667 lstat 31.000000 dtype: float64 ``` ### Summary of Tree Ensemble Methods | Method | Key Idea | Strength | Weakness | | --- | --- | --- | --- | | Bagging | Reduce variance by averaging predictions from independent trees built on bootstrapped samples | Improves accuracy, stability | Interpretability reduced | | Random Forests | Improve bagging by randomly selecting features at each split, decorrelating trees | Further variance reduction | Similar to Bagging | | Boosting | Sequentially build trees, each focusing on correcting errors from previous trees | Reduces both bias and variance | Prone to overfitting if not tuned | | BART | Bayesian approach iteratively refines trees by proposing small changes (perturbations) | Robust, often good out-of-the-box | Less directly interpretable | ## Reference * Chap 8 in [An Introduction to Statistical Learning with Applications in Python](https://www.statlearning.com)