LMArena says it is updating its leaderboard policies after a Llama 4 Maverick version, which Meta said in fine print is not public, secured the number two spot

With Llama 4, Meta fudged benchmarks to appear as though its new AI model is better than the competition.

The Verge 2025-04-08 Kylie Robison

Discussion

@viticci.macstories.net Federico Viticci on bluesky
Surprisingly absolutely no one, Meta - a company that doesn't even know how to spell “taste” - got caught fudging their Llama 4 benchmarks. — 😂 — www.theverge.com/meta/645012/ ...
@prietschka Paul Rietschka on bluesky
Meta knowingly trains on proprietary data and games benchmarks. — Great actor in the sector.
@kylierobison.com Kylie Robison on bluesky
this is what we in the biz like to call an “uh oh” www.theverge.com/meta/645012/ ... [image]
@lmarena_ai @lmarena_ai on x
We've seen questions from the community about the latest release of Llama-4 on Arena. To ensure full transparency, we're releasing 2,000+ head-to-head battle results for public review. This includes user prompts, model responses, and user preferences. (link in next tweet)
@mikeisaac Rat King on x
why is meta being stupid about this release? fumbling goodwill of the past and doing some sneaky marketing around how they tested it. again, the average human isn't thinking about this right now but the ones who are in the AI community find this very dumb
@kyliebytes Kylie Robison on x
this is what we in the biz like to call an “uh oh” [image]
@alexeheath Alex Heath on x
“Meta's interpretation of our policy did not match what we expect from model providers”
@ph_singer Philipp Singer on x
I have been saying for months that it is quite trivial to optimize specifically for lmsys leaderboard. There is tons of data out there, there are kaggle competitions and there are other ways to get feedback data. The same thing happens for all benchmarks that are out long enough.
@nrehiew_ @nrehiew_ on x
These examples are extremely damning on the utility of Chatbot arena as a serious benchmark. Look through all the examples that Maverick won, and it's slop after slop after slop. This is the nonsense you are optimizing for if you are trying to goodhart lmsys. Let's be serious [im…
@kyliebytes Kylie Robison on x
i chatted with @simonw about whats going on with meta's llama 4 release - specifically, whats going on with its performance in lmarena https://www.theverge.com/...
@tomwarren Tom Warren on x
Meta is so desperate to be seen as a leader in AI that it's been caught cheating in AI benchmarks. Reminds me of the time Facebook artificially inflated its video views to con advertisers https://www.theverge.com/...
@casper_hansen_ Casper Hansen on x
Meta finetuned a model specifically for LM Arena and didn't tell them?!😱 - Meta should have made it clearer that “Llama-4-Maverick-03-26-Experimental” was a customized model to optimize for human preferences
@vikhyatk Vik on x
This is the clearest evidence that no one should take these rankings seriously. In this example it's super yappy and factually inaccurate, and yet the user voted for Llama 4. The rest aren't any better. [image]
r/singularity r on reddit
Meta got caught gaming AI benchmarks
r/artificial r on reddit
Meta got caught gaming AI benchmarks
r/LocalLLaMA r on reddit
LM Arena confirm that the version of Llama-4 Maverick listed on the arena is a “customized model to optimize for human preference”
r/OpenAI r on reddit
Meta got caught gaming AI benchmarks for Llama 4
@quinnypig.com Corey Quinn on bluesky
Yes, when I think “paragon of business ethics,” I think of Facebook. [embedded post]
@ahmad_al_dahle Ahmad Al-Dahle on x
...That said, we're also hearing some reports of mixed quality across different services. Since we dropped the models as soon as they were ready, we expect it'll take several days for all the public implementations to get dialed in. We'll keep working through our bug fixes and …
@thexeophon @thexeophon on x
Llama 4 on LMsys is a totally different style than Llama 4 elsewhere, even if you use the recommended system prompt. Tried various prompts myself META did not do a specific deployment / system prompt just for LMsys, did they? 👀
@paulgauthier Paul Gauthier on x
Llama 4 Maverick scored 16% on the aider polyglot coding benchmark. https://aider.chat/... [image]
@artificialanlys @artificialanlys on x
Llama 4 independent evals: Maverick (402B total, 17B active) beats Claude 3.7 Sonnet, trails DeepSeek V3 but more efficient; Scout (109B total, 17B active) in-line with GPT-4o mini, ahead of Mistral Small 3.1 We have independently benchmarked Scout and Maverick as scoring 36 and …
@zimmskal Markus Zimmermann on x
Preliminary results for Meta's Llama v4 for DevQualityEval v1.0 DO NOT LOOK GOOD 😱😿 It seems that both models TANKED in Java, which is a big part of the eval. Good in Go and Ruby but not TOP10 good. Meta: Llama v4 Scout 109B - 🏁 Overall score 62.53% mid-range - 🐕‍🦺 With [image]
@chasebrowe32432 Chase Brower on x
This is the first time for any major LLM that I'm genuinely thinking they just straight up trained on the benchmark answers for the mainline benchmarks Llama 4 is failing spectacularly on like every 3rd party bench i've seen
@suchenzang Susan Zhang on x
4D chess move 🧐: use llama4 experimental to hack lmsys, expose the slop preference, and finally discredit the entire ranking system (lather rinse repeat above for academic benchmark maxing too)
@yuchenj_uw Yuchen Jin on x
If Meta actually did this for Llama 4 training to maximize benchmark scores, it's fucked. [image]
@burkov Andriy Burkov on x
“We've also heard claims that we trained on test sets — that's simply not true, and we would never do that.” No one said you trained on the test set. What they said is that you seem to have finetuned to benchmarks. It's especially obvious when you look at the Elo rating and th…
@minchoi Min Choi on x
Yikes. Llama 4 benchmarks looked insane but something feels off. Reddit leaks claim Meta cooked it. Here's what people are saying: [image]
@yacinemtb Kache on x
BREAKING: pseudonymous internet user believes an anonymous internet user. I mean, who would go on the internet and tell lies?
@kimmonismus @kimmonismus on x
Doesnt look all to good for Llama 4. [image]
@suchenzang Susan Zhang on x
> Company leadership suggested blending test sets from various benchmarks during the post-training process If this is actually true for Llama-4, I hope they remember to cite previous work from FAIR (Llama-1 and https://arxiv.org/...) for this unique approach! 🙏 [image]
@abcampbell @abcampbell on x
tech bros are learning something finance bros have known for millennium quants are going to train on the test they can't help it obvious to anyone actually using these models
@kimmonismus @kimmonismus on x
If it is true that Llama, i.e. Meta, cheated in the benchmarks, it would be an unprecedented image loss. Currently, it seems that after the first testings, the mood is rather mediocre anyway. [image]
@hu_yifei Yifei Hu on x
@Yuchenj_UW Another person said: “I participated in the data mix part for sft and rl. I am not aware of such cases.” 让子弹飞一会儿 [image]
@vibagor44145276 @vibagor44145276 on x
The linked post is not true. There are indeed issues with Llama 4, from both the partner side (inference partners barely had time to prep. We sent out a few transformers wheels/vllm wheels mere days before release) and the model side. But there was no such training on test set.
@natolambert Nathan Lambert on x
Seems like Llama 4's reputation is maybe irreparably tarnished by having a separate unreleased model that was overfit to LMArena. Actual model is good, but shows again how crucial messaging and details are.
@joshclemm Josh Clemm on x
Trying to parse Meta's Llama 4 release this weekend? I felt this was a great writeup. In short: Meta's Llama 4 release was very different than past releases, with some odd timing and a different strategy than before. - They added three Mixture-of-Experts models: Scout
@natolambert Nathan Lambert on x
Llama 4 was a messy release: unreleased finetunes boosting scores, rumors of training on test, released on a weekend, etc As (open) models are commoditized / competition grows, what is the role of Meta's Llama efforts in the future? Should they continue? https://www.interconnects…
@chris_j_paxton Chris Paxton on x
All this discourse is making me want a survey on who the hell lmsys raters actually are. I thought the lmsys version was unbearable and almost never rated it positively...
@eliza_luth Lu Liu on x
I have been only trusting the public's choices, my own tests and the judgement of trusted researchers
@natolambert Nathan Lambert on x
Okay Llama 4 is def a littled cooked lol, what is this yap city [image]
@wzhao_nlp Wenting Zhao on x
Time to revisit our paper: Open community-driven evaluation platforms could be corrupted from a few sources of bad annotations, making their results not as trustworthy as we'd like. https://arxiv.org/... [image]
@suchenzang Susan Zhang on x
how did this llama4 score so high on lmsys?? i'm still buckling up to understand qkv through family reunions and weighted values for loving cats... [image]
@techdevnotes @techdevnotes on x
for some reason, the Llama 4 model in Arena uses a lot more Emojis on together . ai, it seems better: [image]
@emollick Ethan Mollick on x
Hopefully the Llama 4 models improve rapidly, as they did in the Llama 3 generation. The initial launch got pretty mixed feedback (including from me) but a good open weights model from Meta would be very useful for many people.
@ylecun Yann LeCun on x
Some carifications about Llama-4.
@omarsar0 Elvis on x
Thanks for clarifying this. Maybe some official docs/guide (prompting/usage tips, recommended settings, error expectations, areas/use cases to apply and how to apply, etc) would be helpful here. I am aware of model cards, prompting guides but I think a lot of folks are running …
@natolambert Nathan Lambert on x
> be me > be zuck > need llama 4 to land > send a model/prompt to LMSYS to get a top1 score, cringe be damned > release a different model as “open source” > think people won't find out even with weights
r/LocalLLaMA r on reddit
“Serious issues in Llama 4 training. I Have Submitted My Resignation to GenAI”
r/LocalLLaMA r on reddit
“...we're also hearing some reports of mixed quality across different services. Since we dropped the models as soon as they were ready …
r/LocalLLaMA r on reddit
Meta Leaker refutes the training on test set claim

Chronicles

LMArena says it is updating its leaderboard policies after a Llama 4 Maverick version, which Meta said in fine print is not public, secured the number two spot

Related Coverage

Discussion