# Better algorithm for calculating ELOs
Let's say you had a matrix $w$ where $w_{ij}$ is known to be the probability that player $i$ beat player $j$. A common algorithm to create "Player Scores" is ELO, where you assign scores such that $w_{ij} \approx \sigma(elo_i - elo_j)$ (Where $\sigma$ is the [sigmoid](https://en.wikipedia.org/wiki/Logistic_function) function). A formal explanation is given [here](https://hackmd.io/0aXg6seERL6uju31T0EYuA).
On [ELO's Wikipedia](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model) it says that this is the SoTA algorithm for calculating ELOs:

It definitely converges fast, but in actually experimenting with it, there are surprising pitfalls:
- You _must_ set $w_{ii} = \frac{1}{2} \; \forall i$ (More precisely, any positive constant will do)
- $p_i$ over time is numerically unstable. E.g. if $p_i$ gets close to 0, you divide by 0 error or explode the next iteration. When $p_i$ explodes, it's hard to calculate terms accurately.
The first bullet point is easy to fix, but the second bullet point requires some algebra.
## Numerically stable iteration on $\vec{e}$
For a given step $s$ in the iteration process, let's call $\vec{p}^{(s)}$ the parameter vector at that step, and $\vec{e}^{(s)} = \ln(\vec{p}^{(s)})$ the ELO vector. Ideally, we iterate over the $\vec{e}$ vector instead of the $\vec{p}$ vector, since $\vec{p}$ becomes unstable when a component is close to $0$ or $\infty$, but $\vec{e}$ will generally stay within $(-10, 10)$ and distances between ELOs are resilient to noise.
First we substitute $\vec{e}$ for $\vec{p}$ to get an iterative formula on $\vec{e}$,
$$
\vec{p}_i^{(s+1)} = \frac{\sum_j w_{ij}\frac{\vec{p}_j^{(s)}}{\vec{p}_i^{(s)} + \vec{p}_j^{(s)}}}{\sum_j w_{ji}\frac{1}{\vec{p}_i^{(s)} + \vec{p}_j^{(s)}}}, \;\;\;\vec{p}^{(0)} = \vec{1}
$$
$$
e^{\vec{e}_i^{(s+1)}} = \frac{\sum_j w_{ij}\frac{e^{\vec{e}_j^{(s)}}}{e^{\vec{e}_i^{(s)}} + e^{\vec{e}_j^{(s)}}}}{\sum_j w_{ji}\frac{1}{e^{\vec{e}_i^{(s)}} + e^{\vec{e}_j^{(s)}}}}, \;\;\;\vec{e}^{(0)} = \vec{0}
$$
$$
\vec{e}_i^{(s+1)} = \ln\left(\frac{\sum_j w_{ij}\frac{e^{\vec{e}_j^{(s)}}}{e^{\vec{e}_i^{(s)}} + e^{\vec{e}_j^{(s)}}}}{\sum_j w_{ji}\frac{1}{e^{\vec{e}_i^{(s)}} + e^{\vec{e}_j^{(s)}}}}\right), \;\;\;\vec{e}^{(0)} = \vec{0}
$$
Now, we can simplify the expression algebraically,
Split into separate logarithms:
$$\vec{e}_i^{(s+1)} = \ln\left(\sum_j w_{ij}\frac{e^{\vec{e}_j^{(s)}}}{e^{\vec{e}_i^{(s)}} + e^{\vec{e}_j^{(s)}}}\right) - \ln\left(\sum_j w_{ji}\frac{1}{e^{\vec{e}_i^{(s)}} + e^{\vec{e}_j^{(s)}}}\right)$$
Factor out $e^{\vec{e}_i^{(s)}}$ from denominators:
$$\vec{e}_i^{(s+1)} = \ln\left(\sum_j w_{ij}\frac{e^{\vec{e}_j^{(s)}}}{e^{\vec{e}_i^{(s)}}(1 + e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}})}\right) - \ln\left(\sum_j w_{ji}\frac{1}{e^{\vec{e}_i^{(s)}}(1 + e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}})}\right)$$
Simplify the numerator fractions:
$$\vec{e}_i^{(s+1)} = \ln\left(\sum_j w_{ij}\frac{e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}{1 + e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}\right) - \ln\left(\sum_j w_{ji}\frac{e^{-\vec{e}_i^{(s)}}}{1 + e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}\right)$$
Extract the $e^{-\vec{e}_i^{(s)}}$ factor from the second term:
$$\vec{e}_i^{(s+1)} = \ln\left(\sum_j w_{ij}\frac{e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}{1 + e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}\right) - \ln\left(e^{-\vec{e}_i^{(s)}}\sum_j w_{ji}\frac{1}{1 + e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}\right)$$
Split the logarithm:
$$\vec{e}_i^{(s+1)} = \ln\left(\sum_j w_{ij}\frac{e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}{1 + e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}\right) - \ln(e^{-\vec{e}_i^{(s)}}) - \ln\left(\sum_j w_{ji}\frac{1}{1 + e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}\right)$$
Simplify $\ln(e^{-\vec{e}_i^{(s)}})$:
$$\vec{e}_i^{(s+1)} = \ln\left(\sum_j w_{ij}\frac{e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}{1 + e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}\right) - (-\vec{e}_i^{(s)}) - \ln\left(\sum_j w_{ji}\frac{1}{1 + e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}\right)$$
Simplify:
$$\vec{e}_i^{(s+1)} = \vec{e}_i^{(s)} + \ln\left(\sum_j w_{ij}\frac{e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}{1 + e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}\right) - \ln\left(\sum_j w_{ji}\frac{1}{1 + e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}\right)$$
Recognize that $\frac{e^x}{1+e^x} = \sigma(x)$ and $\frac{1}{1+e^x} = \sigma(-x)$:
$$\vec{e}_i^{(s+1)} = \vec{e}_i^{(s)} + \ln\left(\sum_j w_{ij}\sigma(\vec{e}_j^{(s)}-\vec{e}_i^{(s)})\right) - \ln\left(\sum_j w_{ji}\sigma(\vec{e}_i^{(s)}-\vec{e}_j^{(s)})\right)$$
And just like that we have a numerically stable algorithm that's easy to compute. To match the original algorithm's normalization that $\prod_i\vec{p}_i = 1$, we normalize the ELOs to a mean of $0$ by setting $\vec{e}^{(s+1)} \leftarrow \vec{e}^{(s+1)} - \frac{\vec{e}^{(s+1)} \cdot \vec{1}}{\dim{\vec{e}}}$.
## Matmuls
Of course, this algorithm provides a formula that must be computed for every single component of $\vec{e}$. Ideally, we use a matrix representation.
Let's define:
- $D^{(s)}$ as the matrix where $D_{ij}^{(s)} = \vec{e}_j^{(s)} - \vec{e}_i^{(s)}$
- $S^{(s)} = \sigma(D^{(s)})$ applied element-wise
Then:
- The first sum becomes: $\sum_j w_{ij}\sigma(\vec{e}_j^{(s)}-\vec{e}_i^{(s)}) = (W \odot S^{(s)})\vec{1}$
- The second sum becomes: $\sum_j w_{ji}\sigma(\vec{e}_i^{(s)}-\vec{e}_j^{(s)}) = (W^T \odot (S^{(s)})^T)\vec{1}$
Therefore, the matrix formula can be written as:
$$D^{(s)} = \mathbf{1}(\vec{e}^{(s)})^T - \vec{e}^{(s)}\mathbf{1}^T$$
$$S^{(s)} = \sigma(D^{(s)})$$
$$\vec{e}^{(s+1)} = \vec{e}^{(s)} + \ln\left((W \odot S^{(s)})\vec{1}\right) - \ln\left((W^T \odot (S^{(s)})^T)\vec{1}\right)$$
where:
- $\odot$ is the Hadamard (element-wise) product
- $\vec{1}$ is a vector of ones
- $\ln$ and $\sigma$ are applied element-wise
## NumPy Code
In NumPy, this can be written as,
```python=
def elos_loss(
w: np.ndarray,
elos: np.ndarray,
) -> float:
N = len(elos)
elos_col = elos.reshape(-1, 1)
elos_row = elos.reshape(1, -1)
# Stable computation of log(exp(elos_i) + exp(elos_j))
max_elos = np.maximum(elos_col, elos_row)
log_pairwise_sums = max_elos + np.log(np.exp(elos_col - max_elos) + np.exp(elos_row - max_elos))
# Calculate elos_i - log(exp(elos_i) + exp(elos_j))
log_diff = np.broadcast_to(elos_col, (N, N)) - log_pairwise_sums
# We want to maximize the loglikelihood of the observed w with respect to elos
loglikelihood = float(np.sum(w * log_diff))
# Return the loss that we're trying to minimize
return -loglikelihood
def calculate_elos(
w: np.ndarray,
*,
epsilon: float = 1e-5,
initial_elos: np.ndarray | None = None,
max_iters: int = 1000,
) -> np.ndarray:
N = len(w)
elos = initial_elos or np.zeros(N)
last_loss = None
for iter in range(max_iters):
# Create all pairwise differences elo_j - elo_i in a matrix
# outer(ones, elos) - outer(elos, ones)
D = elos.reshape(1, N) - elos.reshape(N, 1) # Shape: (N, N)
# S[i,j] = sigmoid(elo_j - elo_i)
S = 1.0 / (1.0 + np.exp(-D))
# Calculate the update terms
numerator = np.sum(w * S, axis=1) # Shape: (N,)
denominator = np.sum(w.T * S.T, axis=1) # Shape: (N,)
# Apply update rule
elos += np.log(numerator) - np.log(denominator)
# Normalize
elos -= np.mean(elos)
# Calculate loss for this iteration
loss = elos_loss(w, elos)
if last_loss is not None and abs(loss-last_loss) < epsilon:
break
last_loss = loss
# Check for convergence
if abs(loss - last_loss) > epsilon:
raise RuntimeError("Could not converge!")
return elos
```
Experimentally, the loss converges to within $10^{-5}$ in about 10-to-30 steps on real-life $w$ matrices.
