XGBoost - HackMD

The model's predicted score for the $i$-th data instance is the summation of the outputs of $K$ regression trees. Each tree, $f_k$, assigns a leaf score to the data instance. The sum of all these tree scores constitutes the model's prediction, $\hat{y}_i$. $$ \hat{y}_i = \sum_{k=1}^{K} f_k(x_i) $$ --- ### I. Objective Function **Let's start with the Objective Function. This is the total value we want to minimize.** **It consists of two parts first part, the Training Loss also known as loss function , used to measure how much the prediction differs from the true answer.** **second part, the Regularization term, used to control model complexity and to prevent overfitting.** **In this formula, y_i is the true value, and y-hat is the predicted value** **The term l represents the method we use to calculate the loss, such as Squared Error or Log Loss.** **In this term, $f_t$ represents the new tree we are currently training in this step.** This is the Objective Function we aim to minimize. It consists of the Training Loss and the Regularization term $$ \mathcal{L}^{(t)} = \sum_{i=1}^{n} l(y_i, \hat{y}_i^{(t-1)} + f_t(x_i)) + \Omega(f_t) $$ * $y_i$: The **true value** of the $i$-th data instance. * $\hat{y}_i$: The **model's predicted value** for the $i$-th data instance. * $l(\cdot)$: The method for calculating the loss, for example: * Squared Error $(y_i - \hat{y}_i)^2$ * Log Loss --- ### II. Regularization Term **"Regularization Term Also called Penalty Term is given by Omega of f k, equals, Gamma T, plus, one half Lambda, times the sum of w squared."** **which penalizes the number of leaves and the magnitude of leaf weights.** **the first term $\gamma$T penalizes the number of terminal nodes (節點)(model complexity)** **the second term penalizes large leaf weights so as(overfitting)** **T 是葉子數量，w 是葉子權重，$\gamma$ 控制增加葉子的成本，$\lambda$ 控制葉子分數的大小** This is the Regularization (Penalty Term), used to constrain the model's complexity and prevent **overfitting**. $\Omega(f_k)$ measures the complexity of the $k$-th tree. $$ \Omega(f_k) = \gamma T + \frac{1}{2} \lambda \sum_{j=1}^T w_j^2 $$ | Symbol | Description | | :--- | :--- | | $T$ | Number of leaves in the tree. | | $w_j$ | Score of the leaf (leaf weight). | | $\gamma$ | Controls the **cost of adding one more leaf**. | | $\lambda$ | Controls the magnitude of the leaf scores ($L_2$ regularization). | --- ### III. Additive Training **Next, since XGBoost trains tree by tree, we call this Additive Training.** **Please look at this formula. y-hat t represents the prediction at step t.** **It equals the prediction from the previous step, plus the new function f-t.** **So our goal is to find this best f-t.** 你可以看這個公式。$\hat{y}_i^{(t)}$ (y-hat t) 代表第 t 輪的預測值。它等於上一輪的預測值，加上這一輪新的函數 $f_t$ 。所以我們的目標，就是要找到這個最好的 $f_t$ Model optimization is most commonly solved through gradients. Since we cannot train all trees at once, we use **additive training**, where a new tree is added in each step to correct errors based on the previous step's base: $$ \hat{y}_i^{(t)} = \hat{y}_i^{(t-1)} + f_t(x_i) $$ $\hat{y}_i^{(t)}$:The prediction for the $i$-th instance after the $t$-th training round. $\hat{y}_i^{(t-1)}$:The prediction of the model for the instance from the previous round (round $t-1$). $f_t(x_i)$:The score (leaf score) generated by the **new $t$-th tree** for the $i$-th data instance. --- ### IV. Loss Function Expansion and Optimization **To optimize the loss function, we use the second-order Taylor Expansion.** Expanding the Objective Function to focus on the $t$-th step: $$ \mathcal{L}^{(t)} = \sum_{i=1}^{n} l(y_i, \hat{y}_i^{(t-1)} + f_t(x_i)) + \Omega(f_t) $$ ![image](https://hackmd.io/_uploads/H1KT8QsbWx.png) Then, using the **second-order Taylor Expansion**: $$ \mathcal{L}^{(t)} \approx \sum_{i=1}^{n} \left[ l(y_i, \hat{y}_i^{(t-1)}) + g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i) \right] + \Omega(f_t) $$ **------------------------------------------------------------------------------------------------------------------------------------------------------------------------** **Here, we define small g and small h as the first and second derivatives of the loss function.** Where $g_i$ and $h_i$ are defined as: $$ g_i = \frac{\partial l(y_i, \hat{y}_i^{(t-1)})}{\partial \hat{y}_i^{(t-1)}} \quad \text{and} \quad h_i = \frac{\partial^2 l(y_i, \hat{y}_i^{(t-1)})}{\partial (\hat{y}_i^{(t-1)})^2} $$ **Since the first term (the previous loss) is a constant, we can remove it to get a simplified objective function.** Removing the constant term $l(y_i, \hat{y}_i^{(t-1)})$ yields the simplified objective function: $$ \mathcal{\tilde{L}}^{(t)} = \sum_{i=1}^{n} \left[ g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i) \right] + \Omega(f_t) $$ --- **so many of u guys might be wondering which leads to the following formula** **OK this is how it dose , we substitute the Regularization term Omega into the equation and reorganize it by grouping the data points into leaves.** **Before we reorganize the equation, there is a key concept to understand.** **For any decision tree, when we input a sample $x_i$, it will eventually fall into a specific leaf node j** **Each leaf node has a specific score, which we call $w_j$.** **So, the prediction for this sample, $f_t(x_i)$, is simply equal to the weight of that leaf, $w_j$.** **That is why we can replace $f_t(x_i)$ with $w_j$ in the formula.** **很多人可能想知道這如何得出以下公式** **好的，原理是這樣的：我們將正規化項 Omega 代入公式，並透過將資料點分組到葉節點來重新組織公式。 ** **在重新組織公式之前，我們需要先理解一個概念。 ** **對於任何決策樹，當我們輸入一個樣本 x 時，它會落入節點 j裡面嘛。 ** **每個葉節點都有一個特定的權重，我們稱之為 w。 ** **因此，樣本的預測值 $f_t(x_i)$ 就等於該葉節點的權重 $w_j$。 ** **這就是為什麼我們可以用 $w_j$ 來取代公式中的 $f_t(x_i)$ 的原因。 ** ![image](https://hackmd.io/_uploads/SyBdX-5-Wl.png) By reorganizing the previous equation and substituting $\Omega(f_t)$, we get: $$ \mathcal{\tilde{L}}^{(t)} = \sum_{j=1}^{T} \left[ \left(\sum_{i \in I_j} g_i\right) w_j + \frac{1}{2} \left(\sum_{i \in I_j} h_i + \lambda\right) w_j^2 \right] + \gamma T $$ Where $G_j$ and $H_j$ are defined as: **We define Big G_j and Big H_j as the sum of gradients for all instances in leaf j** $$ G_j = \sum_{i \in I_j} g_i \quad \text{and} \quad H_j = \sum_{i \in I_j} h_i $$ --- ![image](https://hackmd.io/_uploads/ByqR1EoZZg.png) **We take the partial derivative of the function $f$ with respect to $w_j$.** ![image](https://hackmd.io/_uploads/Sk6TyEiZWx.png) **Since we want to find the minimum loss, we set the derivative to zero.** ![image](https://hackmd.io/_uploads/r1XDeVs-Ze.png) **Substituting this back, we get the Structure Score formula at the bottom** $$ w_j^* = -\frac{G_j}{H_j + \lambda} $$ The minimum **Loss Function (Structure Score)** is: $$ \mathcal{\tilde{L}}^{(t)}(q) = -\frac{1}{2} \sum_{j=1}^{T} \frac{G_j^2}{H_j + \lambda} + \gamma T $$ --- The above equation is used to **evaluate the quality of a tree structure (Structure Score / Gain)**. ![image](https://hackmd.io/_uploads/S1U-zvq1bl.png)