Learning Pairwise Comparisons

# Learning Pairwise Comparisons Let's say that we have an Oracle pairwise comparison function $C_O: Q \times \mathcal{C} \times \mathcal{C} \rightarrow [0, 1]$, that maps a query and two documents in the corpus $\mathcal{C}$ to a value between $0$ and $1$, where $0$ implies that the first document is better and $1$ implies the second document is better ("better" as in, more relevant to the query). This pairwise comparison function can immediately be used to ELO-score an array of documents into absolute scores. But, there's an issue, the Oracle is very expensive to inference. In practice, we define $C_O$ as sampling an Ensemble of large LLMs (OpenAI, Gemini, Anthropic) and taking the average. Each LLM is prompted to output a score between -1.0 and 1.0, which we transform and clamp to one of $\{0.0, 0.5, 1.0\}$ (Any negative goes to $0.0$, any positive goes to $1.0$, error or zero goes to $0.5$). When taking the average of 3, we get $\{0.00, 0.33, 0.50, 0.66, 1.0\}$, based on either strong consensus, weak 2/3 majority, or $0.5$ if there was no majority. This costs \$20/1000 comparisons. In order to make this more cost-effective, we train a model $M$ to create a comparison function $C_M: Q \times \mathcal{C} \times \mathcal{C} \rightarrow [0, 1]$. The model can be trained using standard [binary cross-entropy (BCE) classification loss](https://docs.pytorch.org/docs/stable/generated/torch.nn.BCELoss.html) against the ensemble's annotations. When using an 8B model, this can cost <\$0.10\1000 comparisons at inference, over 3 orders of magnitude cheaper. ## Experimental Results This experiment was executed in practice, and the results will be shown here. First, a dataset is generated where $\mathcal{C}$ is the set of 29M PubMed documents, $Q = \{q_1, ..., q_{5300}\}$ is the set of $5300$ queries generated by [BioASQ](https://bioasq.org/), and $\{D_1, ..., D_{5300}\}$ is a set of per-question candidate document arrays, where $|D_i| = 100\; \forall i$. The document arrays were generated using [ZeroEntropy's hybrid search](https://docs.zeroentropy.dev/architecture) for each question. From this, a dataset of $10000$ pairs was generated, where queries are sampled evenly ($4700$ questions sampled twice, $600$ questions sampled once), and two random documents $1 \le i, j \le 100$ are used for each pair. Additionally, pairs are sampled such that the same document is never sampled twice. Then, we split the $10000$ pairs into $8000$ train and $2000$ test. Of the $2000$ test, $1441$ of them are full consensus. ![image](https://hackmd.io/_uploads/Byz1mJLbee.png) > Figure 1. This graph contains the training results, where the y-axis is the ratio out of $1441$ of the times that the 8B-Llama model preferred the document that the Ensemble reached consensus on If we train on only $0/1$ consensus pairs, the model will never learn "confidence", and will always pick a random $0/1$ because it was never penalized for guessing. So, we train on ternary classification "A"/"B"/"Tie", where $\mathbf{P}(\text{"Tie"}) = 1.0 - 2.0 * \text{abs}(\text{Ensemble Score} - 0.5)$. Note that $0 \le \mathbf{P}_{\text{Llama 8b}}(\text{"A"}) + \mathbf{P}_{\text{Llama 8b}}(\text{"B"}) \le 1$ in this case. Now, instead of testing on $1441$ consensus, we can test on all $2000$, where the "ensemble score" is the $\{0.00, 0.33, 0.50, 0.66, 1.0\}$ we discussed earlier. ![TernaryClassification](https://hackmd.io/_uploads/BJZgJQU-ex.gif) > Figure 2. We create a 2000-point scatterplot of the test set, where the color is the ensemble score, and $(x, y) = (\mathbf{P}_{\text{Llama 8b}}(\text{"A"}), \mathbf{P}_{\text{Llama 8b}}(\text{"B"}))$, and watch the scatterplot evolve as the model trains (The title mentions the step count). > > It's of interest that the "shape" of the triangle never really converged, but kept oscillating with each checkpoint. Now what we can do, is we can "fold" the scatterplot along $y = x$, so that any points such that $y > x$ are reflected across the line $y = x$, and such points also have their color flipped (Leading to the color "Red" being dominant). ![image](https://hackmd.io/_uploads/S1bADlUWlx.png) > Figure 3. The region marked with purple arrows is where we "want" all of the points to be. We also "want" all points to be marked as red (Never blue). In this case, we can visually see that most of the misclassified blue points also have low confidence, which is good. But, the region within the purple arrows is dense and hard to interpret, so we create a new graph by drawing a purple circle with center $(1, 0)$, and radius $r$, and then for a given radius, we consider the distribution of ensemble scores among all points in this radius. ![image](https://hackmd.io/_uploads/SkxUHugL-ge.png) > Figure 4. We range the radius $r$ from $0 \rightarrow 1$. ![image](https://hackmd.io/_uploads/SysJskUZge.png) > Figure 5. Here the x-axis is the purple circle radius $r$. While we could use the blue/red spectrum here, we swap to a 5-hot encoding of $\{0.00, 0.33, 0.50, 0.66, 1.0\}$ now. "Red" maps to Green, "Blue" maps to Red. "Gray" maps to Yellow. "Yield $\in [0, 1]$" is the percent of points within the purple circle (i.e. # of points within the purple circle, divided by 2000). But, note that this graph is cumulative. So, we want to differentiate this graph. To do this, we evenly split the range $[0, 1]$ into twenty "radius-range" buckets, where each bucket has an equal number of points in it. Then, we graph the color distribution of ensemble scores within each bucket. In preparation, let's swap the x-axis and Yield curve, ![image](https://hackmd.io/_uploads/H1GNyxIble.png) > Figure 6. Now, the x-axis is rank, and the purple line is the radius $r$ of the purple circle. Then, we swap from cumulative with respect to rank, to rank buckets, ![image](https://hackmd.io/_uploads/HJRTiJIZxg.png) > Figure 7. The purple line is Llama's predicted $\mu$, i.e., $\mathbf{P}_{\text{Llama 8b}}(B) + \frac{\mathbf{P}_{\text{Llama 8b}}(\text{"Tie"})}{2}$. The blue line is the observed $\mu_\text{Ensemble}$ for all points in that $\mu_\text{Llama 8b}$ bucket. Somewhat surprisingly, we get really good results of Llama's self-measured confidence. In other words, when Llama predicts that $\mathbf{P}(B) = 0.7$, it really means that the mean of the Ensemble's predictions within that bucket is $0.7$. Theoretically, if we feed Llama random unpredictable garbage, we can predict that our training process will create a purple line $y = 0.5$ (Rather than a purple line $y = 1 - x$, which would be a uniform distribution in $[0, 1]$ - exactly the same classification accuracy, but no concept of "self-measured confidence"). ## Predicting Ensemble's Confidence Also, interestingly, because we projected the points from the 2D scatter plot onto the radius $r$, there's a simplification. We can drop $\mathbf{P}_{\text{Llama 8b}}(\text{"Tie"})$, and go back to binary classification. With binary classification, we only train a single logit to output $\mathbf{P}_{\text{Llama 8b}}(\text{"B"})$, and we set $\mathbf{P}_{\text{Llama 8b}}(\text{"A"}) = 1 - \mathbf{P}_{\text{Llama 8b}}(\text{"B"})$, $\mathbf{P}_{\text{Llama 8b}}(\text{"Tie"}) = 0$. Training a model in this way has no effect on the graph of Figure 7. _However_, we _do_ lose information. If you look back at Figure 4, notice how the misclassified blue/red points are close to the line $y = 1 - x$, meanwhile the gray "tie" points are close to the origin $(0, 0)$. ![image](https://hackmd.io/_uploads/ByKBZYuWex.png) > Figure 8. The same as Figure 4, but looking for patterns. Note how even within the same $\mu_\text{Llama 8b} = 0.55$ bucket, $\mathbf{P}_{\text{Llama 8b}}(\text{"Tie"})$ is a useful "confidence" measure. As in, within the $\mu_{\text{Llama 8b}} = 0.55$ bucket, it could be the case that the two documents are extremely equal (gray), OR it could be the case that the two documents have a large relevancy delta but the classifier is unable to figure out which document is better (colorful "hard-to-classify" region). Proposal: I propose that, for Figure 4., using a purple circle centered at $(1, 0)$ was not ideal for measuring distance. Instead, for a given $(x, y)$, we should take the [perpendicular distance](https://en.wikipedia.org/wiki/Distance_from_a_point_to_a_line#:~:text=The%20distance%20\(or%20perpendicular%20distance,is%20perpendicular%20to%20the%20line.) from $(x, y)$ to both the lines $y = x - 1$ and $y = 1 - x$. And, that these distances are intended to measure $\mathbb{E}[C_O]$ and $\mathbb{E}[|C_O|]$, respectively (Recall $C_O$ is the Ensemble-annotated Score). ![image](https://hackmd.io/_uploads/rJ9Pcl8-gx.png) > Figure 9. Notice how misclassified red/blue points are closer to $(0.35, 0.2)$, but gray points are nearer to $(0, 0)$. Even the colored points near $(0.2, 0.05)$ have a lighter hue, versus the vibrant colors near to $(0.35, 0.2)$. In order to break it down into equal-population segments in both dimensions, we can split the triangle like so: ![image](https://hackmd.io/_uploads/ryOxSbUZxx.png) > Figure 10. The triangle being segmented based on $\mathbb{E}[C_O]$ and $\mathbb{E}[|C_O|]$, where each segment has an approximately equal number of points. But..., before I continue analyzing this deeper, let's take a step back and figure out where we can use $\mathbb{E}[|C_O|]$ in practice. ## Using $\mathbb{E}[|C_O|]$ when training a Pointwise Model Our goal isn't to stay in pairwise comparison, we want to use ELO to train a model $M_\text{llama-8b-pointwise}$ on an absolute scoring task $A_{M_\text{llama-8b-pointwise}}: Q \times \mathcal{C} \rightarrow [0, 1]$. The way we do this is by following our [Empirical ELO calculations](https://hackmd.io/@-Gjw1zWMSH6lMPRlziQFEw/SyYxFaBWex), which can convert a pairwise dataset into an absolute-scored synthetically generated SFT dataset. For a given question index $q$ and document array $D_q$ where $|D_q| = 100$, if we inferenced $M_\text{llama-8b-pairwise}$ on $\approx 100\log_2(100)$ pairs, we can assign absolute ELO Scores to the documents. The way we do this is with the following algorithm: ```python= w = np.zeros((N, N)) w.fill_diagonal(0.5) for pair in pairs: w[pair.i][pair.j] += pair.score w[pair.j][pair.i] += 1.0 - pair.score elos = calculate_elos(w) ``` However, this doesn't take $\mathbb{E}[|C_O|]$ into account. In ELO, the $W_{ij} + W_{ji}$ is intended to equal the "number of games" that players $i$ and $j$ played together. In the [log-likelihood optimization of ELO](https://hackmd.io/@-Gjw1zWMSH6lMPRlziQFEw/SyYxFaBWex#Empirical-ELO), the weight of the cross-entropy term is higher when more games have been played together. So, while the ratio $W_{ij} : W_{ji}$ is the only thing that matters for the target relative ELO between two players, the _influence_ that this $i, j$ relationship on the final ELO scores depends on the _strength_ $W_{ij} + W_{ji}$ (Relative to the rest of the matrix, that is. Of course, scaling $W$ by a constant will have no effect). I propose that $\mathbb{E}[|C_O|]$ should influence the strength $W_{ij} + W_{ji}$, but we should mathematically derive exactly what that relationship should be (linear? squared? square root?). Once we have such a formula, then we should train our pairwise to predict both $\mathbb{E}[C_O]$ and $\mathbb{E}[|C_O|]$. Until then, we'll just predict a single logit for $\mathbb{E}[C_O]$. We might also want to think of [ELO Adjustments](https://en.chessbase.com/post/the-elo-rating-system-correcting-the-expectancy-tables).