reasoning-from-scratch/chF/03_leaderboards at main · rasbt/reasoning-from-scratch · GitHub
Skip to content

Latest commit

 

History

History

Folders and files

README.md

Leaderboard Rankings

This bonus material implements two different ways to construct LM Arena (formerly Chatbot Arena) style leaderboards from pairwise comparisons.

Both implementations take in a list of pairwise preferences (left: winner, right: loser) from a json file via the --path argument. Here's an excerpt of the provided votes.json file:

[
  ["GPT-5", "Claude-3"],
  ["GPT-5", "Llama-4"],
  ["Claude-3", "Llama-3"],
  ["Llama-4", "Llama-3"],
  ...
]


Note: If you are not a uv user, replace uv run ...py with python ...py in the examples below.


 

Method 1: Elo ratings

  • Implements the popular Elo rating method (inspired by chess rankings) that was originally used by LM Arena
  • See the main notebook for details
➜  03_leaderboards git:(main) ✗ uv run 1_elo_leaderboard.py --path votes.json

Leaderboard (Elo) 
-----------------------
 1. GPT-5       1095.9
 2. Claude-3    1058.7
 3. Llama-4      958.2
 4. Llama-3      887.2

 

Method 2: Bradley-Terry model

➜  03_leaderboards git:(main) ✗ uv run 2_bradley_terry_leaderboard.py --path votes.json 

Leaderboard (Bradley-Terry)
-----------------------------
 1. GPT-5       1140.6
 2. Claude-3    1058.7
 3. Llama-4      950.3
 4. Llama-3      850.4