# Bradley-Terry Model as a Loss Function
[Original Bradley-Terry Formula](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model#Inference) (to maximize log-likelihood):
$$\sum_{ij} [w_{ij} \ln(p_i) - w_{ij} \ln(p_i + p_j)]$$
As in, our goal to optimize $p_i$ such that the likelihood of the observed $w_{ij}$ is maximized, following a standard [MLE](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation) model.
## Loss Function
Converting this formula to a Loss Function (negating so that the goal is minimal loss):
$$\mathcal{L} = -\sum_{ij} [w_{ij} \ln(p_i) - w_{ij} \ln(p_i + p_j)]$$
Now we'll convert from $p_i$ to $elo_i$, given the relationship:
$$\text{elo}_i = \ln(p_i),\;\;\; p_i = e^{\text{elo}_i}$$
Substituting $p_i = e^{\text{elo}_i}$ and $p_j = e^{\text{elo}_j}$:
$$\mathcal{L} = -\sum_{ij} [w_{ij} \ln(e^{\text{elo}_i}) - w_{ij} \ln(e^{\text{elo}_i} + e^{\text{elo}_j})]$$
$$\mathcal{L} = -\sum_{ij} [w_{ij} \cdot \text{elo}_i - w_{ij} \ln(e^{\text{elo}_i} + e^{\text{elo}_j})]$$
$$\mathcal{L} = \sum_{ij} [w_{ij} \ln(e^{\text{elo}_i} + e^{\text{elo}_j}) - w_{ij} \cdot \text{elo}_i]$$
$$\mathcal{L} = \sum_{ij} [w_{ij} \ln(e^{\text{elo}_i}(1 + e^{\text{elo}_j - \text{elo}_i})) - w_{ij} \cdot \text{elo}_i]$$
$$\mathcal{L} = \sum_{ij} [w_{ij} (\text{elo}_i + \ln(1 + e^{\text{elo}_j - \text{elo}_i})) - w_{ij} \cdot \text{elo}_i]$$
$$\mathcal{L} = \sum_{ij} [w_{ij} \ln(1 + e^{\text{elo}_j - \text{elo}_i})]$$
This is our final loss function.
## Gradient of the Loss Function
Now let's calculate the gradient with respect to $elo_i$.
To find $\frac{\partial \mathcal{L}}{\partial \text{elo}_i}$, we need to consider:
1. When $elo_i$ appears in the first position ($i$)
2. When $elo_i$ appears in the second position ($j$)
For Case 1 ($elo_i$ in position $i$):
$$\frac{\partial}{\partial \text{elo}_i}[\ln(1 + e^{\text{elo}_j - \text{elo}_i})] = \frac{1}{1 + e^{\text{elo}_j - \text{elo}_i}} \cdot (-e^{\text{elo}_j - \text{elo}_i}) = -\frac{e^{\text{elo}_j - \text{elo}_i}}{1 + e^{\text{elo}_j - \text{elo}_i}}$$
For Case 2 ($elo_i$ in position $j$):
$$\frac{\partial}{\partial \text{elo}_i}[\ln(1 + e^{\text{elo}_i - \text{elo}_k})] = \frac{1}{1 + e^{\text{elo}_i - \text{elo}_k}} \cdot e^{\text{elo}_i - \text{elo}_k} = \frac{e^{\text{elo}_i - \text{elo}_k}}{1 + e^{\text{elo}_i - \text{elo}_k}}$$
Combining both cases:
$$\frac{\partial \mathcal{L}}{\partial \text{elo}_i} = \sum_{j} \left[-w_{ij} \cdot \frac{e^{\text{elo}_j - \text{elo}_i}}{1 + e^{\text{elo}_j - \text{elo}_i}}\right] + \sum_{j} \left[w_{ji} \cdot \frac{e^{\text{elo}_i - \text{elo}_j}}{1 + e^{\text{elo}_i - \text{elo}_j}}\right]$$
$$\frac{\partial \mathcal{L}}{\partial \text{elo}_i} = \sum_{j} \left[-w_{ij} \cdot \frac{e^{\text{elo}_j - \text{elo}_i}}{1 + e^{\text{elo}_j - \text{elo}_i}} \cdot \frac{e^{\text{elo}_i - \text{elo}_j}}{e^{\text{elo}_i - \text{elo}_j}}\right] + \sum_{j} \left[w_{ji} \cdot \frac{e^{\text{elo}_i - \text{elo}_j}}{1 + e^{\text{elo}_i - \text{elo}_j}} \cdot \frac{e^{\text{elo}_j - \text{elo}_i}}{e^{\text{elo}_j - \text{elo}_i}}\right]$$
$$\frac{\partial \mathcal{L}}{\partial \text{elo}_i} = \sum_{j} \left[-w_{ij} \cdot \frac{1}{e^{\text{elo}_i - \text{elo}_j} + 1}\right] + \sum_{j} \left[w_{ji} \cdot \frac{1}{e^{\text{elo}_j - \text{elo}_i} + 1}\right]$$
Using the sigmoid function $\sigma(x) = \frac{1}{1 + e^{-x}}$, we can rewrite:
$$\frac{\partial \mathcal{L}}{\partial \text{elo}_i} = \sum_{j} \left[-w_{ij} \cdot \sigma(\text{elo}_j-\text{elo}_i)\right] + \sum_{j} \left[w_{ji} \cdot \sigma(\text{elo}_i-\text{elo}_j)\right]$$
Noting the identity $\sigma(x) + \sigma(-x) = 1$:
$$\frac{\partial \mathcal{L}}{\partial \text{elo}_i} = \sum_{j} \left[-w_{ij} \cdot \sigma(\text{elo}_j-\text{elo}_i)\right] + \sum_{j} \left[w_{ji} \cdot (1 - \sigma(\text{elo}_j-\text{elo}_i))\right]$$
$$\frac{\partial \mathcal{L}}{\partial \text{elo}_i} = \sum_{j} \left[-w_{ij} \cdot \sigma(\text{elo}_j-\text{elo}_i)\right] + \sum_{j} \left[w_{ji} - w_{ji} \cdot \sigma(\text{elo}_j-\text{elo}_i)\right]$$
$$\frac{\partial \mathcal{L}}{\partial \text{elo}_i} = \sum_{j} \left[w_{ji} - (w_{ij} + w_{ji}) \cdot \sigma(\text{elo}_j-\text{elo}_i)\right]$$
If we have the constraint that $w_{ji} + w_{ij} = 1$, then:
$$\frac{\partial \mathcal{L}}{\partial \text{elo}_i} = \sum_{j} [w_{ji} - \sigma(\text{elo}_j - \text{elo}_i)]$$
Resolving to a fairly elegant formula, showing that the gradient of $\mathcal{L}$ with respect to $elo_i$ is the difference between the actual observed win rate $w_{ji}$, and the current policy-predicted probability of $j$ beating $i$, summed over all $j$.