---
# System prepended metadata

title: Better Algorithm for ELO

---

# Better algorithm for calculating ELOs

Let's say you had a matrix $w$ where $w_{ij}$ is known to be the probability that player $i$ beat player $j$. A common algorithm to create "Player Scores" is ELO, where you assign scores such that $w_{ij} \approx \sigma(elo_i - elo_j)$ (Where $\sigma$ is the [sigmoid](https://en.wikipedia.org/wiki/Logistic_function) function). A formal explanation is given [here](https://hackmd.io/0aXg6seERL6uju31T0EYuA).

On [ELO's Wikipedia](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model) it says that this is the SoTA algorithm for calculating ELOs:

![image](https://hackmd.io/_uploads/By5uE0sxgg.png)

It definitely converges fast, but in actually experimenting with it, there are surprising pitfalls:

- You _must_ set $w_{ii} = \frac{1}{2} \; \forall i$ (More precisely, any positive constant will do)
- $p_i$ over time is numerically unstable. E.g. if $p_i$ gets close to 0, you divide by 0 error or explode the next iteration. When $p_i$ explodes, it's hard to calculate terms accurately.

The first bullet point is easy to fix, but the second bullet point requires some algebra.

## Numerically stable iteration on $\vec{e}$

For a given step $s$ in the iteration process, let's call $\vec{p}^{(s)}$ the parameter vector at that step, and $\vec{e}^{(s)} = \ln(\vec{p}^{(s)})$ the ELO vector. Ideally, we iterate over the $\vec{e}$ vector instead of the $\vec{p}$ vector, since $\vec{p}$ becomes unstable when a component is close to $0$ or $\infty$, but $\vec{e}$ will generally stay within $(-10, 10)$ and distances between ELOs are resilient to noise.

First we substitute $\vec{e}$ for $\vec{p}$ to get an iterative formula on $\vec{e}$,

$$
\vec{p}_i^{(s+1)} = \frac{\sum_j w_{ij}\frac{\vec{p}_j^{(s)}}{\vec{p}_i^{(s)} + \vec{p}_j^{(s)}}}{\sum_j w_{ji}\frac{1}{\vec{p}_i^{(s)} + \vec{p}_j^{(s)}}}, \;\;\;\vec{p}^{(0)} = \vec{1}
$$

$$
e^{\vec{e}_i^{(s+1)}} = \frac{\sum_j w_{ij}\frac{e^{\vec{e}_j^{(s)}}}{e^{\vec{e}_i^{(s)}} + e^{\vec{e}_j^{(s)}}}}{\sum_j w_{ji}\frac{1}{e^{\vec{e}_i^{(s)}} + e^{\vec{e}_j^{(s)}}}}, \;\;\;\vec{e}^{(0)} = \vec{0}
$$

$$
\vec{e}_i^{(s+1)} = \ln\left(\frac{\sum_j w_{ij}\frac{e^{\vec{e}_j^{(s)}}}{e^{\vec{e}_i^{(s)}} + e^{\vec{e}_j^{(s)}}}}{\sum_j w_{ji}\frac{1}{e^{\vec{e}_i^{(s)}} + e^{\vec{e}_j^{(s)}}}}\right), \;\;\;\vec{e}^{(0)} = \vec{0}
$$

Now, we can simplify the expression algebraically,

Split into separate logarithms:
$$\vec{e}_i^{(s+1)} = \ln\left(\sum_j w_{ij}\frac{e^{\vec{e}_j^{(s)}}}{e^{\vec{e}_i^{(s)}} + e^{\vec{e}_j^{(s)}}}\right) - \ln\left(\sum_j w_{ji}\frac{1}{e^{\vec{e}_i^{(s)}} + e^{\vec{e}_j^{(s)}}}\right)$$

Factor out $e^{\vec{e}_i^{(s)}}$ from denominators:
$$\vec{e}_i^{(s+1)} = \ln\left(\sum_j w_{ij}\frac{e^{\vec{e}_j^{(s)}}}{e^{\vec{e}_i^{(s)}}(1 + e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}})}\right) - \ln\left(\sum_j w_{ji}\frac{1}{e^{\vec{e}_i^{(s)}}(1 + e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}})}\right)$$

Simplify the numerator fractions:
$$\vec{e}_i^{(s+1)} = \ln\left(\sum_j w_{ij}\frac{e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}{1 + e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}\right) - \ln\left(\sum_j w_{ji}\frac{e^{-\vec{e}_i^{(s)}}}{1 + e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}\right)$$

Extract the $e^{-\vec{e}_i^{(s)}}$ factor from the second term:
$$\vec{e}_i^{(s+1)} = \ln\left(\sum_j w_{ij}\frac{e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}{1 + e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}\right) - \ln\left(e^{-\vec{e}_i^{(s)}}\sum_j w_{ji}\frac{1}{1 + e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}\right)$$

Split the logarithm:
$$\vec{e}_i^{(s+1)} = \ln\left(\sum_j w_{ij}\frac{e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}{1 + e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}\right) - \ln(e^{-\vec{e}_i^{(s)}}) - \ln\left(\sum_j w_{ji}\frac{1}{1 + e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}\right)$$

Simplify $\ln(e^{-\vec{e}_i^{(s)}})$:
$$\vec{e}_i^{(s+1)} = \ln\left(\sum_j w_{ij}\frac{e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}{1 + e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}\right) - (-\vec{e}_i^{(s)}) - \ln\left(\sum_j w_{ji}\frac{1}{1 + e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}\right)$$

Simplify:
$$\vec{e}_i^{(s+1)} = \vec{e}_i^{(s)} + \ln\left(\sum_j w_{ij}\frac{e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}{1 + e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}\right) - \ln\left(\sum_j w_{ji}\frac{1}{1 + e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}\right)$$

Recognize that $\frac{e^x}{1+e^x} = \sigma(x)$ and $\frac{1}{1+e^x} = \sigma(-x)$:
$$\vec{e}_i^{(s+1)} = \vec{e}_i^{(s)} + \ln\left(\sum_j w_{ij}\sigma(\vec{e}_j^{(s)}-\vec{e}_i^{(s)})\right) - \ln\left(\sum_j w_{ji}\sigma(\vec{e}_i^{(s)}-\vec{e}_j^{(s)})\right)$$

And just like that we have a numerically stable algorithm that's easy to compute. To match the original algorithm's normalization that $\prod_i\vec{p}_i = 1$, we normalize the ELOs to a mean of $0$ by setting $\vec{e}^{(s+1)} \leftarrow \vec{e}^{(s+1)} - \frac{\vec{e}^{(s+1)} \cdot \vec{1}}{\dim{\vec{e}}}$.

## Matmuls

Of course, this algorithm provides a formula that must be computed for every single component of $\vec{e}$. Ideally, we use a matrix representation.

Let's define:
- $D^{(s)}$ as the matrix where $D_{ij}^{(s)} = \vec{e}_j^{(s)} - \vec{e}_i^{(s)}$
- $S^{(s)} = \sigma(D^{(s)})$ applied element-wise

Then:
- The first sum becomes: $\sum_j w_{ij}\sigma(\vec{e}_j^{(s)}-\vec{e}_i^{(s)}) = (W \odot S^{(s)})\vec{1}$
- The second sum becomes: $\sum_j w_{ji}\sigma(\vec{e}_i^{(s)}-\vec{e}_j^{(s)}) = (W^T \odot (S^{(s)})^T)\vec{1}$

Therefore, the matrix formula can be written as:

$$D^{(s)} = \mathbf{1}(\vec{e}^{(s)})^T - \vec{e}^{(s)}\mathbf{1}^T$$
$$S^{(s)} = \sigma(D^{(s)})$$
$$\vec{e}^{(s+1)} = \vec{e}^{(s)} + \ln\left((W \odot S^{(s)})\vec{1}\right) - \ln\left((W^T \odot (S^{(s)})^T)\vec{1}\right)$$

where:
- $\odot$ is the Hadamard (element-wise) product
- $\vec{1}$ is a vector of ones
- $\ln$ and $\sigma$ are applied element-wise

## NumPy Code

In NumPy, this can be written as,

```python=
def elos_loss(
    w: np.ndarray,
    elos: np.ndarray,
) -> float:
    N = len(elos)
    elos_col = elos.reshape(-1, 1)
    elos_row = elos.reshape(1, -1)
   
    # Stable computation of log(exp(elos_i) + exp(elos_j))
    max_elos = np.maximum(elos_col, elos_row)
    log_pairwise_sums = max_elos + np.log(np.exp(elos_col - max_elos) + np.exp(elos_row - max_elos))
   
    # Calculate elos_i - log(exp(elos_i) + exp(elos_j))
    log_diff = np.broadcast_to(elos_col, (N, N)) - log_pairwise_sums
   
    # We want to maximize the loglikelihood of the observed w with respect to elos
    loglikelihood = float(np.sum(w * log_diff))

    # Return the loss that we're trying to minimize
    return -loglikelihood
   
def calculate_elos(
    w: np.ndarray,
    *,
    epsilon: float = 1e-5,
    initial_elos: np.ndarray | None = None,
    max_iters: int = 1000,
) -> np.ndarray:
    N = len(w)
    elos = initial_elos or np.zeros(N)
    
    last_loss = None
    for iter in range(max_iters):
        # Create all pairwise differences elo_j - elo_i in a matrix
        # outer(ones, elos) - outer(elos, ones)
        D = elos.reshape(1, N) - elos.reshape(N, 1)  # Shape: (N, N)
        
        # S[i,j] = sigmoid(elo_j - elo_i)
        S = 1.0 / (1.0 + np.exp(-D))
        
        # Calculate the update terms
        numerator = np.sum(w * S, axis=1) # Shape: (N,)
        denominator = np.sum(w.T * S.T, axis=1) # Shape: (N,)
        
        # Apply update rule
        elos += np.log(numerator) - np.log(denominator)
        
        # Normalize
        elos -= np.mean(elos)
        
        # Calculate loss for this iteration
        loss = elos_loss(w, elos)
        if last_loss is not None and abs(loss-last_loss) < epsilon:
            break
        last_loss = loss
    
    # Check for convergence
    if abs(loss - last_loss) > epsilon:
        raise RuntimeError("Could not converge!")
    
    return elos
```

Experimentally, the loss converges to within $10^{-5}$ in about 10-to-30 steps on real-life $w$ matrices.

![image](https://hackmd.io/_uploads/H1ZvRl3ege.png)