WB-Elo Computation

# WB-Elo Computation ## Summary We aim to combine WildBench results and LMSYS Elo to better estimate the performance of a new model that is not on LMSYS Chatbot Arena. First, we convert WildBench's individual scoring to virtual pairwise comparisons. Then, we use the LMSYS Elo as the starting Elo for the models that are on Chatbot Arena, and use 1000 for the others. Finally, we use WB's virtual pairwise results to simulate Arena-style competition and update the Elos. The final Elos are then the WB-Elo. ## Code links: - https://huggingface.co/spaces/allenai/WildBench/blob/main/analysis_scripts/wb_elo_imitation.py ## Pipeline ### Step 0: Virtual Pairwise Comparisons Ideally, we should use our pairwise evaluation template to evaluate each pair of models on WB's 1024 tasks, but it will be very expensive. Imagine that we have N models; there would be N*(N-1)/2 pairs, and each pair would cost about $15 USD (i.e., $18,000 USD for 50 models). To reduce costs and increase efficiency, we decided to use the WB-Score results and convert them to pairwise comparisons. WB-Scores are individually computed for each model, with each model costing $5 USD (i.e., $250 USD for 50 models). For each pair of models, we go through the 1024 examples. For example x, if Model A's score on x is higher than Model B's score, we consider A the winner of this virtual pairwise comparison. To avoid noise, we set a margin of M (M=3 for now), meaning that only when the score difference is greater than or equal to 3 do we consider there to be a winner; otherwise, it will be a tie. In this way, we can create up to N*(N-1)/2*1024 pairwise comparisons for computing the WB-Elo. ### Step 1: Elo Initialization We consider our WB-Elo as a continuation of LMSYS Elo. For models that have LMSYS Elo, we use their values as the initial values. Other models will start with an Elo of 1000. ### Step 2: WB-Elo Computation Following the standard Elo computation, we can then proceed to use WB's virtual pairwise comparisons for running the Elo computation.