Chatbot Arena Shenanigans?

AI leaderboards are not all they are cracked up to be — research shows some may favor large corporations and be subject to being gamed.

Nick Bild
1 month agoMachine Learning & AI
AI leaderboards are not always what they appear to be (📷: S. Singh et al.)

Whenever some type of reward is at stake, you will find that lots of people try to take a shortcut to get the reward without all of the hard work that would normally be involved. This starts early, with students looking for loopholes that allow them to earn good grades without doing any more work than is absolutely necessary. Later on, many of those same people will be scouring the tax code to look for loopholes that help to keep more of their money in their own pockets.

In some cases, the individuals that come up with these types of life hacks are guilty of nothing more than being clever or efficient. After all, they are working within the bounds of the established rules. But in other cases, gaming the system is just plain old cheating. Unfortunately, the latter situation is happening on the leaderboards that rank large language models (LLMs), where a high ranking can make the difference between being the next big thing and instant obsolescence.

What is truth?

A recent study conducted by researchers at Cohere Labs, Princeton, Stanford, and MIT has raised serious concerns about Chatbot Arena, a popular platform used to rank the performance of AI systems, particularly LLMs. Created in 2023, Chatbot Arena allows users to compare two anonymous model responses to a prompt and vote for the better one. While this format aims to reflect real-world use cases, researchers now say the leaderboard may be deeply flawed.

According to the authors of the study — some of whom have submitted open-weight models themselves — Chatbot Arena's evaluation process appears to favor a small group of major AI providers like Meta, OpenAI, Google, and Amazon. These companies are reportedly allowed to test multiple private versions of their models before choosing the best-performing one to present publicly. This selective disclosure gives them a significant advantage, allowing them to optimize for the leaderboard without demonstrating actual improvements in general model quality.

The study uncovered that Meta tested 27 separate LLM variants in the lead-up to its Llama 4 release, benefiting from a behind-the-scenes process that smaller or open-source developers do not have access to. Compounding the issue, proprietary models are sampled more frequently in battles and are less likely to be silently removed from the leaderboard. This means they get more data to train and improve their models.

Unfair advantages have got to go

Estimates from the report show that OpenAI and Google have received roughly 20% each of all Arena feedback data, while 83 open-weight models together received less than 30% of the total. This data imbalance leads to noticeable performance differences — the team found that simply increasing access to Arena data from 0% to 70% more than doubled a model's win rate on a standardized test set.

As it stands, Chatbot Arena may no longer be a level playing field — and in a rapidly evolving industry, that could skew not just rankings, but the future of AI research itself. Despite the criticism, the researchers acknowledge the immense effort involved in running Chatbot Arena and believe the problems stem from gradual shifts rather than malicious intent. As such, they would like to help right the ship, and toward that goal they have shared specific recommendations with the organizers to restore fairness and scientific accuracy to their benchmarks.

Nick Bild
R&D, creativity, and building the next big thing you never knew you wanted are my specialties.
Latest articles
Sponsored articles
Related articles
Latest articles
Read more
Related articles