LMArena says it is updating its leaderboard policies after a Llama 4 Maverick version, which Meta said in fine print is not public, secured the number two spot
With Llama 4, Meta fudged benchmarks to appear as though its new AI model is better than the competition.
The Verge Kylie Robison
Related Coverage
- Llama 4's Rocky Debut The Information
- Meta scrambling to defend its AI after Llama 4 benchmark bungle Sherwood News · Jon Keegan
- Meta Denies Any Wrongdoing in Llama 4 Benchmarks Analytics India Magazine · Siddharth Jindal
- The Sequence Knowledge #527: What Types of AI Benchmarks Should You Care About? TheSequence · Jesus Rodriguez
- Meta pushes back on Llama 4 benchmark cheating allegations Neowin · David Uzondu
- Meta's Llama drama Platformer · Casey Newton
- Meta defends Llama 4 release against ‘reports of mixed quality,’ blames bugs VentureBeat · Carl Franzen
- Not honest ol' Meta! (falls on fainting couch) — https://www.theverge.com/... @kyleford@hachyderm.io · Kyle Ford
- Meta got caught gaming AI benchmarks Hacker News
- Meta Got Caught Gaming AI Benchmarks 7 Slashdot · Msmash
- Meta's benchmarks for its new AI models are a bit misleading TechCrunch · Kyle Wiggers
- Welcome to Llama-4-Maverick-03-26-Experimental battles Hugging Face
- Deep Learning, Deep Scandal Marcus on AI · Gary Marcus
- Meta Unleashes New Llama 4 AI Models AIwire · Ali Azhar
- Meta's Llama 4 Models Now Available on Krutrim Cloud Analytics India Magazine · Siddharth Jindal
- Meta launches AI family Llama 4 — but the EU doesn't get everything Computerworld · Mikael Markander
- From a political shift to a more powerful AI: Everything to know about Meta's Llama 4 models Euronews · Pascale Davies
- Meta is coming for Google and OpenAI with its fresh Llama 4 models Android Central · Jay Bonggolto
- Meet Meta's Llama 4: Bigger brains, sharper vision, more modalities Capacity Media · Ben Wodecki
- Meta Dropped Llama 4: What to Know About the Two New AI Models CNET · Katelyn Chedraoui
- A Chinese whistleblower from Meta's AI team on Llama 4: After repeated training … Tony Peng on Substack
- Meta's Llama 4 models show promise on standard tests, but struggle with long-context tasks The Decoder · Matthias Bastian
- Llama 4: Did Meta just push the panic button? Interconnects · Nathan Lambert
- Serious issues in Llama 4 training. I Have Submitted My Resignation to GenAI" Hacker News
Discussion
-
@viticci.macstories.net
Federico Viticci
on bluesky
Surprisingly absolutely no one, Meta - a company that doesn't even know how to spell “taste” - got caught fudging their Llama 4 benchmarks. — 😂 — www.theverge.com/meta/645012/ ...
-
@prietschka
Paul Rietschka
on bluesky
Meta knowingly trains on proprietary data and games benchmarks. — Great actor in the sector.
-
@kylierobison.com
Kylie Robison
on bluesky
this is what we in the biz like to call an “uh oh” www.theverge.com/meta/645012/ ... [image]
-
@lmarena_ai
@lmarena_ai
on x
We've seen questions from the community about the latest release of Llama-4 on Arena. To ensure full transparency, we're releasing 2,000+ head-to-head battle results for public review. This includes user prompts, model responses, and user preferences. (link in next tweet)
-
@mikeisaac
Rat King
on x
why is meta being stupid about this release? fumbling goodwill of the past and doing some sneaky marketing around how they tested it. again, the average human isn't thinking about this right now but the ones who are in the AI community find this very dumb
-
@kyliebytes
Kylie Robison
on x
this is what we in the biz like to call an “uh oh” [image]
-
@alexeheath
Alex Heath
on x
“Meta's interpretation of our policy did not match what we expect from model providers”
-
@ph_singer
Philipp Singer
on x
I have been saying for months that it is quite trivial to optimize specifically for lmsys leaderboard. There is tons of data out there, there are kaggle competitions and there are other ways to get feedback data. The same thing happens for all benchmarks that are out long enough.
-
@nrehiew_
@nrehiew_
on x
These examples are extremely damning on the utility of Chatbot arena as a serious benchmark. Look through all the examples that Maverick won, and it's slop after slop after slop. This is the nonsense you are optimizing for if you are trying to goodhart lmsys. Let's be serious [im…
-
@kyliebytes
Kylie Robison
on x
i chatted with @simonw about whats going on with meta's llama 4 release - specifically, whats going on with its performance in lmarena https://www.theverge.com/...
-
@tomwarren
Tom Warren
on x
Meta is so desperate to be seen as a leader in AI that it's been caught cheating in AI benchmarks. Reminds me of the time Facebook artificially inflated its video views to con advertisers https://www.theverge.com/...
-
@casper_hansen_
Casper Hansen
on x
Meta finetuned a model specifically for LM Arena and didn't tell them?!😱 - Meta should have made it clearer that “Llama-4-Maverick-03-26-Experimental” was a customized model to optimize for human preferences
-
@vikhyatk
Vik
on x
This is the clearest evidence that no one should take these rankings seriously. In this example it's super yappy and factually inaccurate, and yet the user voted for Llama 4. The rest aren't any better. [image]
-
r/singularity
r
on reddit
Meta got caught gaming AI benchmarks
-
r/artificial
r
on reddit
Meta got caught gaming AI benchmarks
-
r/LocalLLaMA
r
on reddit
LM Arena confirm that the version of Llama-4 Maverick listed on the arena is a “customized model to optimize for human preference”
-
r/OpenAI
r
on reddit
Meta got caught gaming AI benchmarks for Llama 4
-
@quinnypig.com
Corey Quinn
on bluesky
Yes, when I think “paragon of business ethics,” I think of Facebook. [embedded post]
-
@ahmad_al_dahle
Ahmad Al-Dahle
on x
...That said, we're also hearing some reports of mixed quality across different services. Since we dropped the models as soon as they were ready, we expect it'll take several days for all the public implementations to get dialed in. We'll keep working through our bug fixes and …
-
@thexeophon
@thexeophon
on x
Llama 4 on LMsys is a totally different style than Llama 4 elsewhere, even if you use the recommended system prompt. Tried various prompts myself META did not do a specific deployment / system prompt just for LMsys, did they? 👀
-
@paulgauthier
Paul Gauthier
on x
Llama 4 Maverick scored 16% on the aider polyglot coding benchmark. https://aider.chat/... [image]
-
@artificialanlys
@artificialanlys
on x
Llama 4 independent evals: Maverick (402B total, 17B active) beats Claude 3.7 Sonnet, trails DeepSeek V3 but more efficient; Scout (109B total, 17B active) in-line with GPT-4o mini, ahead of Mistral Small 3.1 We have independently benchmarked Scout and Maverick as scoring 36 and …
-
@zimmskal
Markus Zimmermann
on x
Preliminary results for Meta's Llama v4 for DevQualityEval v1.0 DO NOT LOOK GOOD 😱😿 It seems that both models TANKED in Java, which is a big part of the eval. Good in Go and Ruby but not TOP10 good. Meta: Llama v4 Scout 109B - 🏁 Overall score 62.53% mid-range - 🐕🦺 With [image]
-
@chasebrowe32432
Chase Brower
on x
This is the first time for any major LLM that I'm genuinely thinking they just straight up trained on the benchmark answers for the mainline benchmarks Llama 4 is failing spectacularly on like every 3rd party bench i've seen
-
@suchenzang
Susan Zhang
on x
4D chess move 🧐: use llama4 experimental to hack lmsys, expose the slop preference, and finally discredit the entire ranking system (lather rinse repeat above for academic benchmark maxing too)
-
@yuchenj_uw
Yuchen Jin
on x
If Meta actually did this for Llama 4 training to maximize benchmark scores, it's fucked. [image]
-
@burkov
Andriy Burkov
on x
“We've also heard claims that we trained on test sets — that's simply not true, and we would never do that.” No one said you trained on the test set. What they said is that you seem to have finetuned to benchmarks. It's especially obvious when you look at the Elo rating and th…
-
@minchoi
Min Choi
on x
Yikes. Llama 4 benchmarks looked insane but something feels off. Reddit leaks claim Meta cooked it. Here's what people are saying: [image]
-
@yacinemtb
Kache
on x
BREAKING: pseudonymous internet user believes an anonymous internet user. I mean, who would go on the internet and tell lies?
-
@kimmonismus
@kimmonismus
on x
Doesnt look all to good for Llama 4. [image]
-
@suchenzang
Susan Zhang
on x
> Company leadership suggested blending test sets from various benchmarks during the post-training process If this is actually true for Llama-4, I hope they remember to cite previous work from FAIR (Llama-1 and https://arxiv.org/...) for this unique approach! 🙏 [image]
-
@abcampbell
@abcampbell
on x
tech bros are learning something finance bros have known for millennium quants are going to train on the test they can't help it obvious to anyone actually using these models
-
@kimmonismus
@kimmonismus
on x
If it is true that Llama, i.e. Meta, cheated in the benchmarks, it would be an unprecedented image loss. Currently, it seems that after the first testings, the mood is rather mediocre anyway. [image]
-
@hu_yifei
Yifei Hu
on x
@Yuchenj_UW Another person said: “I participated in the data mix part for sft and rl. I am not aware of such cases.” 让子弹飞一会儿 [image]
-
@vibagor44145276
@vibagor44145276
on x
The linked post is not true. There are indeed issues with Llama 4, from both the partner side (inference partners barely had time to prep. We sent out a few transformers wheels/vllm wheels mere days before release) and the model side. But there was no such training on test set.
-
@natolambert
Nathan Lambert
on x
Seems like Llama 4's reputation is maybe irreparably tarnished by having a separate unreleased model that was overfit to LMArena. Actual model is good, but shows again how crucial messaging and details are.
-
@joshclemm
Josh Clemm
on x
Trying to parse Meta's Llama 4 release this weekend? I felt this was a great writeup. In short: Meta's Llama 4 release was very different than past releases, with some odd timing and a different strategy than before. - They added three Mixture-of-Experts models: Scout
-
@natolambert
Nathan Lambert
on x
Llama 4 was a messy release: unreleased finetunes boosting scores, rumors of training on test, released on a weekend, etc As (open) models are commoditized / competition grows, what is the role of Meta's Llama efforts in the future? Should they continue? https://www.interconnects…
-
@chris_j_paxton
Chris Paxton
on x
All this discourse is making me want a survey on who the hell lmsys raters actually are. I thought the lmsys version was unbearable and almost never rated it positively...
-
@eliza_luth
Lu Liu
on x
I have been only trusting the public's choices, my own tests and the judgement of trusted researchers
-
@natolambert
Nathan Lambert
on x
Okay Llama 4 is def a littled cooked lol, what is this yap city [image]
-
@wzhao_nlp
Wenting Zhao
on x
Time to revisit our paper: Open community-driven evaluation platforms could be corrupted from a few sources of bad annotations, making their results not as trustworthy as we'd like. https://arxiv.org/... [image]
-
@suchenzang
Susan Zhang
on x
how did this llama4 score so high on lmsys?? i'm still buckling up to understand qkv through family reunions and weighted values for loving cats... [image]
-
@techdevnotes
@techdevnotes
on x
for some reason, the Llama 4 model in Arena uses a lot more Emojis on together . ai, it seems better: [image]
-
@emollick
Ethan Mollick
on x
Hopefully the Llama 4 models improve rapidly, as they did in the Llama 3 generation. The initial launch got pretty mixed feedback (including from me) but a good open weights model from Meta would be very useful for many people.
-
@ylecun
Yann LeCun
on x
Some carifications about Llama-4.
-
@omarsar0
Elvis
on x
Thanks for clarifying this. Maybe some official docs/guide (prompting/usage tips, recommended settings, error expectations, areas/use cases to apply and how to apply, etc) would be helpful here. I am aware of model cards, prompting guides but I think a lot of folks are running …
-
@natolambert
Nathan Lambert
on x
> be me > be zuck > need llama 4 to land > send a model/prompt to LMSYS to get a top1 score, cringe be damned > release a different model as “open source” > think people won't find out even with weights
-
r/LocalLLaMA
r
on reddit
“Serious issues in Llama 4 training. I Have Submitted My Resignation to GenAI”
-
r/LocalLLaMA
r
on reddit
“...we're also hearing some reports of mixed quality across different services. Since we dropped the models as soon as they were ready …
-
r/LocalLLaMA
r
on reddit
Meta Leaker refutes the training on test set claim