Better Algorithm for ELO

# Better algorithm for calculating ELOs Let's say you had a matrix $w$ where $w_{ij}$ is known to be the probability that player $i$ beat player $j$. A common algorithm to create "Player Scores" is ELO, where you assign scores such that $w_{ij} \approx \sigma(elo_i - elo_j)$ (Where $\sigma$ is the [sigmoid](https://en.wikipedia.org/wiki/Logistic_function) function). A formal explanation is given [here](https://hackmd.io/0aXg6seERL6uju31T0EYuA). On [ELO's Wikipedia](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model) it says that this is the SoTA algorithm for calculating ELOs: ![image](https://hackmd.io/_uploads/By5uE0sxgg.png) It definitely converges fast, but in actually experimenting with it, there are surprising pitfalls: - You _must_ set $w_{ii} = \frac{1}{2} \; \forall i$ (More precisely, any positive constant will do) - $p_i$ over time is numerically unstable. E.g. if $p_i$ gets close to 0, you divide by 0 error or explode the next iteration. When $p_i$ explodes, it's hard to calculate terms accurately. The first bullet point is easy to fix, but the second bullet point requires some algebra. ## Numerically stable iteration on $\vec{e}$ For a given step $s$ in the iteration process, let's call $\vec{p}^{(s)}$ the parameter vector at that step, and $\vec{e}^{(s)} = \ln(\vec{p}^{(s)})$ the ELO vector. Ideally, we iterate over the $\vec{e}$ vector instead of the $\vec{p}$ vector, since $\vec{p}$ becomes unstable when a component is close to $0$ or $\infty$, but $\vec{e}$ will generally stay within $(-10, 10)$ and distances between ELOs are resilient to noise. First we substitute $\vec{e}$ for $\vec{p}$ to get an iterative formula on $\vec{e}$, $$ \vec{p}_i^{(s+1)} = \frac{\sum_j w_{ij}\frac{\vec{p}_j^{(s)}}{\vec{p}_i^{(s)} + \vec{p}_j^{(s)}}}{\sum_j w_{ji}\frac{1}{\vec{p}_i^{(s)} + \vec{p}_j^{(s)}}}, \;\;\;\vec{p}^{(0)} = \vec{1} $$ $$ e^{\vec{e}_i^{(s+1)}} = \frac{\sum_j w_{ij}\frac{e^{\vec{e}_j^{(s)}}}{e^{\vec{e}_i^{(s)}} + e^{\vec{e}_j^{(s)}}}}{\sum_j w_{ji}\frac{1}{e^{\vec{e}_i^{(s)}} + e^{\vec{e}_j^{(s)}}}}, \;\;\;\vec{e}^{(0)} = \vec{0} $$ $$ \vec{e}_i^{(s+1)} = \ln\left(\frac{\sum_j w_{ij}\frac{e^{\vec{e}_j^{(s)}}}{e^{\vec{e}_i^{(s)}} + e^{\vec{e}_j^{(s)}}}}{\sum_j w_{ji}\frac{1}{e^{\vec{e}_i^{(s)}} + e^{\vec{e}_j^{(s)}}}}\right), \;\;\;\vec{e}^{(0)} = \vec{0} $$ Now, we can simplify the expression algebraically, Split into separate logarithms: $$\vec{e}_i^{(s+1)} = \ln\left(\sum_j w_{ij}\frac{e^{\vec{e}_j^{(s)}}}{e^{\vec{e}_i^{(s)}} + e^{\vec{e}_j^{(s)}}}\right) - \ln\left(\sum_j w_{ji}\frac{1}{e^{\vec{e}_i^{(s)}} + e^{\vec{e}_j^{(s)}}}\right)$$ Factor out $e^{\vec{e}_i^{(s)}}$ from denominators: $$\vec{e}_i^{(s+1)} = \ln\left(\sum_j w_{ij}\frac{e^{\vec{e}_j^{(s)}}}{e^{\vec{e}_i^{(s)}}(1 + e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}})}\right) - \ln\left(\sum_j w_{ji}\frac{1}{e^{\vec{e}_i^{(s)}}(1 + e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}})}\right)$$ Simplify the numerator fractions: $$\vec{e}_i^{(s+1)} = \ln\left(\sum_j w_{ij}\frac{e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}{1 + e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}\right) - \ln\left(\sum_j w_{ji}\frac{e^{-\vec{e}_i^{(s)}}}{1 + e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}\right)$$ Extract the $e^{-\vec{e}_i^{(s)}}$ factor from the second term: $$\vec{e}_i^{(s+1)} = \ln\left(\sum_j w_{ij}\frac{e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}{1 + e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}\right) - \ln\left(e^{-\vec{e}_i^{(s)}}\sum_j w_{ji}\frac{1}{1 + e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}\right)$$ Split the logarithm: $$\vec{e}_i^{(s+1)} = \ln\left(\sum_j w_{ij}\frac{e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}{1 + e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}\right) - \ln(e^{-\vec{e}_i^{(s)}}) - \ln\left(\sum_j w_{ji}\frac{1}{1 + e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}\right)$$ Simplify $\ln(e^{-\vec{e}_i^{(s)}})$: $$\vec{e}_i^{(s+1)} = \ln\left(\sum_j w_{ij}\frac{e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}{1 + e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}\right) - (-\vec{e}_i^{(s)}) - \ln\left(\sum_j w_{ji}\frac{1}{1 + e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}\right)$$ Simplify: $$\vec{e}_i^{(s+1)} = \vec{e}_i^{(s)} + \ln\left(\sum_j w_{ij}\frac{e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}{1 + e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}\right) - \ln\left(\sum_j w_{ji}\frac{1}{1 + e^{\vec{e}_j^{(s)}-\vec{e}_i^{(s)}}}\right)$$ Recognize that $\frac{e^x}{1+e^x} = \sigma(x)$ and $\frac{1}{1+e^x} = \sigma(-x)$: $$\vec{e}_i^{(s+1)} = \vec{e}_i^{(s)} + \ln\left(\sum_j w_{ij}\sigma(\vec{e}_j^{(s)}-\vec{e}_i^{(s)})\right) - \ln\left(\sum_j w_{ji}\sigma(\vec{e}_i^{(s)}-\vec{e}_j^{(s)})\right)$$ And just like that we have a numerically stable algorithm that's easy to compute. To match the original algorithm's normalization that $\prod_i\vec{p}_i = 1$, we normalize the ELOs to a mean of $0$ by setting $\vec{e}^{(s+1)} \leftarrow \vec{e}^{(s+1)} - \frac{\vec{e}^{(s+1)} \cdot \vec{1}}{\dim{\vec{e}}}$. ## Matmuls Of course, this algorithm provides a formula that must be computed for every single component of $\vec{e}$. Ideally, we use a matrix representation. Let's define: - $D^{(s)}$ as the matrix where $D_{ij}^{(s)} = \vec{e}_j^{(s)} - \vec{e}_i^{(s)}$ - $S^{(s)} = \sigma(D^{(s)})$ applied element-wise Then: - The first sum becomes: $\sum_j w_{ij}\sigma(\vec{e}_j^{(s)}-\vec{e}_i^{(s)}) = (W \odot S^{(s)})\vec{1}$ - The second sum becomes: $\sum_j w_{ji}\sigma(\vec{e}_i^{(s)}-\vec{e}_j^{(s)}) = (W^T \odot (S^{(s)})^T)\vec{1}$ Therefore, the matrix formula can be written as: $$D^{(s)} = \mathbf{1}(\vec{e}^{(s)})^T - \vec{e}^{(s)}\mathbf{1}^T$$ $$S^{(s)} = \sigma(D^{(s)})$$ $$\vec{e}^{(s+1)} = \vec{e}^{(s)} + \ln\left((W \odot S^{(s)})\vec{1}\right) - \ln\left((W^T \odot (S^{(s)})^T)\vec{1}\right)$$ where: - $\odot$ is the Hadamard (element-wise) product - $\vec{1}$ is a vector of ones - $\ln$ and $\sigma$ are applied element-wise ## NumPy Code In NumPy, this can be written as, ```python= def elos_loss( w: np.ndarray, elos: np.ndarray, ) -> float: N = len(elos) elos_col = elos.reshape(-1, 1) elos_row = elos.reshape(1, -1) # Stable computation of log(exp(elos_i) + exp(elos_j)) max_elos = np.maximum(elos_col, elos_row) log_pairwise_sums = max_elos + np.log(np.exp(elos_col - max_elos) + np.exp(elos_row - max_elos)) # Calculate elos_i - log(exp(elos_i) + exp(elos_j)) log_diff = np.broadcast_to(elos_col, (N, N)) - log_pairwise_sums # We want to maximize the loglikelihood of the observed w with respect to elos loglikelihood = float(np.sum(w * log_diff)) # Return the loss that we're trying to minimize return -loglikelihood def calculate_elos( w: np.ndarray, *, epsilon: float = 1e-5, initial_elos: np.ndarray | None = None, max_iters: int = 1000, ) -> np.ndarray: N = len(w) elos = initial_elos or np.zeros(N) last_loss = None for iter in range(max_iters): # Create all pairwise differences elo_j - elo_i in a matrix # outer(ones, elos) - outer(elos, ones) D = elos.reshape(1, N) - elos.reshape(N, 1) # Shape: (N, N) # S[i,j] = sigmoid(elo_j - elo_i) S = 1.0 / (1.0 + np.exp(-D)) # Calculate the update terms numerator = np.sum(w * S, axis=1) # Shape: (N,) denominator = np.sum(w.T * S.T, axis=1) # Shape: (N,) # Apply update rule elos += np.log(numerator) - np.log(denominator) # Normalize elos -= np.mean(elos) # Calculate loss for this iteration loss = elos_loss(w, elos) if last_loss is not None and abs(loss-last_loss) < epsilon: break last_loss = loss # Check for convergence if abs(loss - last_loss) > epsilon: raise RuntimeError("Could not converge!") return elos ``` Experimentally, the loss converges to within $10^{-5}$ in about 10-to-30 steps on real-life $w$ matrices. ![image](https://hackmd.io/_uploads/H1ZvRl3ege.png)