Empirical ELO Calculations

# Empirical ELO Calculations There are a few open problems at ZeroEntropy related to ELO. ELO is an [MLE](https://hackmd.io/7PwDRMtXS1CIutUel3XDAg)-based model for converting an observed $N$x$N$ weight matrix $W$ into a length-$N$ elo vector $\vec{e}$. First we'll do some setup. Let's say we have a corpus $\mathcal{C}$ of documents, and an array of documents $d_1, ..., d_N \in \mathcal{C}$ that we want to calculate ELO scores for. Our goal is to create an $N$x$N$ weight matrix $W$ such that $$W_{ij} = \mathbf{P}(d_i \text{ beats } d_j)$$ We do this by modelling that every document $d_i$ can be represented by a latent "elo score" $\vec{e}_i$, such that $$\mathbf{P}(d_i \text{ beats } d_j) = \sigma(\vec{e}_i-\vec{e}_j)$$ Where $\sigma(x) = \frac{1}{1+e^{-x}}$ is the sigmoid function. This modeling presumption takes is from $O(N^2)$ free parameters in $W$, to merely $O(N)$ free parameters in $\vec{e}$. Of course, we can't truly know $\mathbf{P}(d_i \text{ beats } d_j)$ without sampling and using such samples to deduce the most likely elo vector. The process of doing so is described below: ## Empirical ELO Let's say that $G = \{g_1, ..., g_{|G|}\}$ games have been played, where each game $g_x$ is represented by a 2-tuple of indices $(i, j)$, indicating that $i$ beat $j$ for that game. Let's define the empirical $\tilde{W}$ such that $$\tilde{W}_{ij} = |\{g \in G : g = (i,j)\}|$$ Then, let's say we had an initial proposed elo vector $\hat{\vec{e}}$. Under this proposal, we have that $$\mathbf{P}(d_i \text{ beats } d_j) = \sigma(\hat{\vec{e}}_i-\hat{\vec{e}}_j)$$ Then, we can calculate the probability of empirically observing $\tilde{W}$ under the presumption that $\hat{\vec{e}}$ is the underlying ELO, $$ \mathbf{P}(\tilde{W} \mid \hat{\vec{e}}) = \prod_{i,j} \left[ \mathbf{P}(d_i \text{ beats } d_j) \right]^{\tilde{W}_{ij}} = \prod_{i,j} \left[ \sigma(\hat{\vec{e}}_i - \hat{\vec{e}}_j) \right]^{\tilde{W}_{ij}} $$ Taking the natural logarithm, we obtain the log-likelihood of the observed results: $$ \log \mathbf{P}(\tilde{W} \mid \hat{\vec{e}}) = \sum_{i,j} \tilde{W}_{ij} \log \sigma(\hat{\vec{e}}_i - \hat{\vec{e}}_j) $$ Then, using MLE, we declare that the best-fit $\hat{\vec{e}}$ is such that the log-likelihood is maximized, $$ \hat{\vec{e}}_\text{best} = \underset{\hat{\vec{e}}}{\arg\max}\;\mathbf{P}(\tilde{W} \mid \hat{\vec{e}}) = \underset{\hat{\vec{e}}}{\arg\max} \log \mathbf{P}(\tilde{W} \mid \hat{\vec{e}}) = \underset{\hat{\vec{e}}}{\arg\max} \sum_{i,j} \tilde{W}_{ij} \log \sigma(\hat{\vec{e}}_i - \hat{\vec{e}}_j) $$ Which can then be solved for with gradient descent, or [this optimized algorithm for ELO](https://hackmd.io/x3_EkXGKRdeq-rNHo_RpZA).