This bonus material implements two different ways to construct LM Arena (formerly Chatbot Arena) style leaderboards from pairwise comparisons.
Both implementations take in a list of pairwise preferences (left: winner, right: loser) from a json file via the --path argument. Here's an excerpt of the provided votes.json file:
[
["GPT-5", "Claude-3"],
["GPT-5", "Llama-4"],
["Claude-3", "Llama-3"],
["Llama-4", "Llama-3"],
...
]Note: If you are not a uv user, replace uv run ...py with python ...py in the examples below.
- Implements the popular Elo rating method (inspired by chess rankings) that was originally used by LM Arena
- See the main notebook for details
➜ 03_leaderboards git:(main) ✗ uv run 1_elo_leaderboard.py --path votes.json
Leaderboard (Elo)
-----------------------
1. GPT-5 1095.9
2. Claude-3 1058.7
3. Llama-4 958.2
4. Llama-3 887.2
- Implements a Bradley-Terry model, similar to the new LM Arena leaderboard as described in the official paper (Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference)
- Like on the LM Arena leaderboard, the scores are re-scaled to be similar to the original Elo scores
- The code here uses the Adam optimizer from PyTorch to fit the model (for better code familiarity and readability)
➜ 03_leaderboards git:(main) ✗ uv run 2_bradley_terry_leaderboard.py --path votes.json
Leaderboard (Bradley-Terry)
-----------------------------
1. GPT-5 1140.6
2. Claude-3 1058.7
3. Llama-4 950.3
4. Llama-3 850.4