Gradient of ELO Loss

# Bradley-Terry Model as a Loss Function [Original Bradley-Terry Formula](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model#Inference) (to maximize log-likelihood): $$\sum_{ij} [w_{ij} \ln(p_i) - w_{ij} \ln(p_i + p_j)]$$ As in, our goal to optimize $p_i$ such that the likelihood of the observed $w_{ij}$ is maximized, following a standard [MLE](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation) model. ## Loss Function Converting this formula to a Loss Function (negating so that the goal is minimal loss): $$\mathcal{L} = -\sum_{ij} [w_{ij} \ln(p_i) - w_{ij} \ln(p_i + p_j)]$$ Now we'll convert from $p_i$ to $elo_i$, given the relationship: $$\text{elo}_i = \ln(p_i),\;\;\; p_i = e^{\text{elo}_i}$$ Substituting $p_i = e^{\text{elo}_i}$ and $p_j = e^{\text{elo}_j}$: $$\mathcal{L} = -\sum_{ij} [w_{ij} \ln(e^{\text{elo}_i}) - w_{ij} \ln(e^{\text{elo}_i} + e^{\text{elo}_j})]$$ $$\mathcal{L} = -\sum_{ij} [w_{ij} \cdot \text{elo}_i - w_{ij} \ln(e^{\text{elo}_i} + e^{\text{elo}_j})]$$ $$\mathcal{L} = \sum_{ij} [w_{ij} \ln(e^{\text{elo}_i} + e^{\text{elo}_j}) - w_{ij} \cdot \text{elo}_i]$$ $$\mathcal{L} = \sum_{ij} [w_{ij} \ln(e^{\text{elo}_i}(1 + e^{\text{elo}_j - \text{elo}_i})) - w_{ij} \cdot \text{elo}_i]$$ $$\mathcal{L} = \sum_{ij} [w_{ij} (\text{elo}_i + \ln(1 + e^{\text{elo}_j - \text{elo}_i})) - w_{ij} \cdot \text{elo}_i]$$ $$\mathcal{L} = \sum_{ij} [w_{ij} \ln(1 + e^{\text{elo}_j - \text{elo}_i})]$$ This is our final loss function. ## Gradient of the Loss Function Now let's calculate the gradient with respect to $elo_i$. To find $\frac{\partial \mathcal{L}}{\partial \text{elo}_i}$, we need to consider: 1. When $elo_i$ appears in the first position ($i$) 2. When $elo_i$ appears in the second position ($j$) For Case 1 ($elo_i$ in position $i$): $$\frac{\partial}{\partial \text{elo}_i}[\ln(1 + e^{\text{elo}_j - \text{elo}_i})] = \frac{1}{1 + e^{\text{elo}_j - \text{elo}_i}} \cdot (-e^{\text{elo}_j - \text{elo}_i}) = -\frac{e^{\text{elo}_j - \text{elo}_i}}{1 + e^{\text{elo}_j - \text{elo}_i}}$$ For Case 2 ($elo_i$ in position $j$): $$\frac{\partial}{\partial \text{elo}_i}[\ln(1 + e^{\text{elo}_i - \text{elo}_k})] = \frac{1}{1 + e^{\text{elo}_i - \text{elo}_k}} \cdot e^{\text{elo}_i - \text{elo}_k} = \frac{e^{\text{elo}_i - \text{elo}_k}}{1 + e^{\text{elo}_i - \text{elo}_k}}$$ Combining both cases: $$\frac{\partial \mathcal{L}}{\partial \text{elo}_i} = \sum_{j} \left[-w_{ij} \cdot \frac{e^{\text{elo}_j - \text{elo}_i}}{1 + e^{\text{elo}_j - \text{elo}_i}}\right] + \sum_{j} \left[w_{ji} \cdot \frac{e^{\text{elo}_i - \text{elo}_j}}{1 + e^{\text{elo}_i - \text{elo}_j}}\right]$$ $$\frac{\partial \mathcal{L}}{\partial \text{elo}_i} = \sum_{j} \left[-w_{ij} \cdot \frac{e^{\text{elo}_j - \text{elo}_i}}{1 + e^{\text{elo}_j - \text{elo}_i}} \cdot \frac{e^{\text{elo}_i - \text{elo}_j}}{e^{\text{elo}_i - \text{elo}_j}}\right] + \sum_{j} \left[w_{ji} \cdot \frac{e^{\text{elo}_i - \text{elo}_j}}{1 + e^{\text{elo}_i - \text{elo}_j}} \cdot \frac{e^{\text{elo}_j - \text{elo}_i}}{e^{\text{elo}_j - \text{elo}_i}}\right]$$ $$\frac{\partial \mathcal{L}}{\partial \text{elo}_i} = \sum_{j} \left[-w_{ij} \cdot \frac{1}{e^{\text{elo}_i - \text{elo}_j} + 1}\right] + \sum_{j} \left[w_{ji} \cdot \frac{1}{e^{\text{elo}_j - \text{elo}_i} + 1}\right]$$ Using the sigmoid function $\sigma(x) = \frac{1}{1 + e^{-x}}$, we can rewrite: $$\frac{\partial \mathcal{L}}{\partial \text{elo}_i} = \sum_{j} \left[-w_{ij} \cdot \sigma(\text{elo}_j-\text{elo}_i)\right] + \sum_{j} \left[w_{ji} \cdot \sigma(\text{elo}_i-\text{elo}_j)\right]$$ Noting the identity $\sigma(x) + \sigma(-x) = 1$: $$\frac{\partial \mathcal{L}}{\partial \text{elo}_i} = \sum_{j} \left[-w_{ij} \cdot \sigma(\text{elo}_j-\text{elo}_i)\right] + \sum_{j} \left[w_{ji} \cdot (1 - \sigma(\text{elo}_j-\text{elo}_i))\right]$$ $$\frac{\partial \mathcal{L}}{\partial \text{elo}_i} = \sum_{j} \left[-w_{ij} \cdot \sigma(\text{elo}_j-\text{elo}_i)\right] + \sum_{j} \left[w_{ji} - w_{ji} \cdot \sigma(\text{elo}_j-\text{elo}_i)\right]$$ $$\frac{\partial \mathcal{L}}{\partial \text{elo}_i} = \sum_{j} \left[w_{ji} - (w_{ij} + w_{ji}) \cdot \sigma(\text{elo}_j-\text{elo}_i)\right]$$ If we have the constraint that $w_{ji} + w_{ij} = 1$, then: $$\frac{\partial \mathcal{L}}{\partial \text{elo}_i} = \sum_{j} [w_{ji} - \sigma(\text{elo}_j - \text{elo}_i)]$$ Resolving to a fairly elegant formula, showing that the gradient of $\mathcal{L}$ with respect to $elo_i$ is the difference between the actual observed win rate $w_{ji}$, and the current policy-predicted probability of $j$ beating $i$, summed over all $j$.