The model's predicted score for the $i$-th data instance is the summation of the outputs of $K$ regression trees. Each tree, $f_k$, assigns a leaf score to the data instance. The sum of all these tree scores constitutes the model's prediction, $\hat{y}_i$.
$$
\hat{y}_i = \sum_{k=1}^{K} f_k(x_i)
$$
---
### I. Objective Function
**Let's start with the Objective Function. This is the total value we want to minimize.**
**It consists of two parts
first part, the Training Loss also known as loss function , used to measure how much the prediction differs from the true answer.**
**second part, the Regularization term, used to control model complexity and to prevent overfitting.**
**In this formula, y_i is the true value, and y-hat is the predicted value**
**The term l represents the method we use to calculate the loss, such as Squared Error or Log Loss.**
**In this term, $f_t$ represents the new tree we are currently training in this step.**
This is the Objective Function we aim to minimize. It consists of the Training Loss and the Regularization term
$$
\mathcal{L}^{(t)} = \sum_{i=1}^{n} l(y_i, \hat{y}_i^{(t-1)} + f_t(x_i)) + \Omega(f_t)
$$
* $y_i$: The **true value** of the $i$-th data instance.
* $\hat{y}_i$: The **model's predicted value** for the $i$-th data instance.
* $l(\cdot)$: The method for calculating the loss, for example:
* Squared Error $(y_i - \hat{y}_i)^2$
* Log Loss
---
### II. Regularization Term
**"Regularization Term Also called Penalty Term is given by Omega of f k, equals, Gamma T, plus, one half Lambda, times the sum of w squared."**
**which penalizes the number of leaves and the magnitude of leaf weights.**
**the first term $\gamma$T penalizes the number of terminal nodes (節點)(model complexity)**
**the second term penalizes large leaf weights so as(overfitting)**
**T 是葉子數量,w 是葉子權重,$\gamma$ 控制增加葉子的成本,$\lambda$ 控制葉子分數的大小**
This is the Regularization (Penalty Term), used to constrain the model's complexity and prevent **overfitting**.
$\Omega(f_k)$ measures the complexity of the $k$-th tree.
$$
\Omega(f_k) = \gamma T + \frac{1}{2} \lambda \sum_{j=1}^T w_j^2
$$
| Symbol | Description |
| :--- | :--- |
| $T$ | Number of leaves in the tree. |
| $w_j$ | Score of the leaf (leaf weight). |
| $\gamma$ | Controls the **cost of adding one more leaf**. |
| $\lambda$ | Controls the magnitude of the leaf scores ($L_2$ regularization). |
---
### III. Additive Training
**Next, since XGBoost trains tree by tree, we call this Additive Training.**
**Please look at this formula. y-hat t represents the prediction at step t.**
**It equals the prediction from the previous step, plus the new function f-t.**
**So our goal is to find this best f-t.**
你可以看這個公式。$\hat{y}_i^{(t)}$ (y-hat t) 代表第 t 輪的預測值。
它等於上一輪的預測值,加上這一輪新的函數 $f_t$ 。
所以我們的目標,就是要找到這個最好的 $f_t$
Model optimization is most commonly solved through gradients. Since we cannot train all trees at once, we use **additive training**, where a new tree is added in each step to correct errors based on the previous step's base:
$$
\hat{y}_i^{(t)} = \hat{y}_i^{(t-1)} + f_t(x_i)
$$
$\hat{y}_i^{(t)}$:The prediction for the $i$-th instance after the $t$-th training round.
$\hat{y}_i^{(t-1)}$:The prediction of the model for the instance from the previous round (round $t-1$).
$f_t(x_i)$:The score (leaf score) generated by the **new $t$-th tree** for the $i$-th data instance.
---
### IV. Loss Function Expansion and Optimization
**To optimize the loss function, we use the second-order Taylor Expansion.**
Expanding the Objective Function to focus on the $t$-th step:
$$
\mathcal{L}^{(t)} = \sum_{i=1}^{n} l(y_i, \hat{y}_i^{(t-1)} + f_t(x_i)) + \Omega(f_t)
$$

Then, using the **second-order Taylor Expansion**:
$$
\mathcal{L}^{(t)} \approx \sum_{i=1}^{n} \left[ l(y_i, \hat{y}_i^{(t-1)}) + g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i) \right] + \Omega(f_t)
$$
**------------------------------------------------------------------------------------------------------------------------------------------------------------------------**
**Here, we define small g and small h as the first and second derivatives of the loss function.**
Where $g_i$ and $h_i$ are defined as:
$$
g_i = \frac{\partial l(y_i, \hat{y}_i^{(t-1)})}{\partial \hat{y}_i^{(t-1)}} \quad \text{and} \quad h_i = \frac{\partial^2 l(y_i, \hat{y}_i^{(t-1)})}{\partial (\hat{y}_i^{(t-1)})^2}
$$
**Since the first term (the previous loss) is a constant, we can remove it to get a simplified objective function.**
Removing the constant term $l(y_i, \hat{y}_i^{(t-1)})$ yields the simplified objective function:
$$
\mathcal{\tilde{L}}^{(t)} = \sum_{i=1}^{n} \left[ g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i) \right] + \Omega(f_t)
$$
---
**so many of u guys might be wondering which leads to the following formula**
**OK this is how it dose , we substitute the Regularization term Omega into the equation and reorganize it by grouping the data points into leaves.**
**Before we reorganize the equation, there is a key concept to understand.**
**For any decision tree, when we input a sample $x_i$, it will eventually fall into a specific leaf node j**
**Each leaf node has a specific score, which we call $w_j$.**
**So, the prediction for this sample, $f_t(x_i)$, is simply equal to the weight of that leaf, $w_j$.**
**That is why we can replace $f_t(x_i)$ with $w_j$ in the formula.**
**很多人可能想知道這如何得出以下公式**
**好的,原理是這樣的:我們將正規化項 Omega 代入公式,並透過將資料點分組到葉節點來重新組織公式。 **
**在重新組織公式之前,我們需要先理解一個概念。 **
**對於任何決策樹,當我們輸入一個樣本 x 時,它會落入節點 j裡面嘛。 **
**每個葉節點都有一個特定的權重,我們稱之為 w。 **
**因此,樣本的預測值 $f_t(x_i)$ 就等於該葉節點的權重 $w_j$。 **
**這就是為什麼我們可以用 $w_j$ 來取代公式中的 $f_t(x_i)$ 的原因。 **

By reorganizing the previous equation and substituting $\Omega(f_t)$, we get:
$$
\mathcal{\tilde{L}}^{(t)} = \sum_{j=1}^{T} \left[ \left(\sum_{i \in I_j} g_i\right) w_j + \frac{1}{2} \left(\sum_{i \in I_j} h_i + \lambda\right) w_j^2 \right] + \gamma T
$$
Where $G_j$ and $H_j$ are defined as:
**We define Big G_j and Big H_j as the sum of gradients for all instances in leaf j**
$$
G_j = \sum_{i \in I_j} g_i \quad \text{and} \quad H_j = \sum_{i \in I_j} h_i
$$
---

**We take the partial derivative of the function $f$ with respect to $w_j$.**

**Since we want to find the minimum loss, we set the derivative to zero.**

**Substituting this back, we get the Structure Score formula at the bottom**
$$
w_j^* = -\frac{G_j}{H_j + \lambda}
$$
The minimum **Loss Function (Structure Score)** is:
$$
\mathcal{\tilde{L}}^{(t)}(q) = -\frac{1}{2} \sum_{j=1}^{T} \frac{G_j^2}{H_j + \lambda} + \gamma T
$$
---
The above equation is used to **evaluate the quality of a tree structure (Structure Score / Gain)**.
