Artificial Analysis benchmarks: Grok 4 is now the leading AI model, a first for xAI; Grok 4's per-token pricing is more expensive than Gemini 2.5 Pro's and o3's
xAI gave us early access to Grok 4 - and the results are in. Grok 4 is now the leading AI model. We have run our full suite of benchmarks and Grok 4 achieves an Artificial Analysis Intelligence Index of 73, ahead of OpenAI o3 at 70, Google Gemini 2.5 Pro at 70, Anthropic Claude [image]
@artificialanlys
Related Coverage
- Musk's Grok-4 Crushes Benchmarks, Beats OpenAI & Google in RL Analytics India Magazine
- Elon Musk's Grok 4 AI Models Set New Benchmark Records Beebom
- Musk unveils Grok 4 as xAI's new AI model that beats OpenAI and Google on major benchmarks The Decoder
- Grok 4 Launch [video] Hacker News
Discussion
-
@arcprize
@arcprize
on x
Grok 4 (Thinking) achieves new SOTA on ARC-AGI-2 with 15.9% This nearly doubles the previous commercial SOTA and tops the current Kaggle competition SOTA [image]
-
@kimmonismus
@kimmonismus
on x
A quick reminder of why Humanity's Last Exam is such a special benchmark, and why it's a technical marvel that Grok 4 has already achieved 44.9% and over 50%, respectively. “In response, we introduce Humanity's Last Exam, a multi-modal benchmark at the frontier of human [image]
-
@francispsantora
Francis Santora
on x
Elon just dropped Grok 4 overnight. Early testing shows it blowing away most other models. So this morning, I ran it through a test of my own... When Grok 3 came out in February, I asked it 3 real-life questions to gauge how good the model is. These are actual questions I needed …
-
@signulll
@signulll
on x
elon delivering world class results with grok 4 while meta's burning $200m per engineer is pretty remarkable. people keep underestimating how much top builders want to follow a strong, even if polarizing leader. vision > perks. conviction > consensus. [image]
-
@twrobinette
Taylor Robinette
on x
Grok 4 benchmarks look super impressive. Really solid results so far in my limited tests. What is clear is that we are nowhere close to having enough compute (for both inference and training) based on what is coming. More data + more compute, still = better performance. [image]
-
@bearlyai
@bearlyai
on x
wow, Grok 4 smokes Gemini 2.5 and OpenAI o3 on ARC-AGI leaderboard [image]
-
@apples_jimmy
@apples_jimmy
on x
That latency of new grok voice is 👌
-
@emollick
Ethan Mollick
on x
50.7% is very, very good though.
-
@andrewarruda
@andrewarruda
on x
xAI team cooked. they should be proud. looks like a big step forward. RL playing a bigger and bigger role now. next 6-12 months in AI are going to be unreal and it's only 2025. incredible. so happy to be alive and young right now.
-
@gregkamradt
Greg Kamradt
on x
We got a call from @xai 24 hours ago “We want to test Grok 4 on ARC-AGI” We heard the rumors. We knew it would be good. We didn't know it would become the #1 public model on ARC-AGI Here's the testing story and what the results mean: Yesterday, we chatted with Jimmy from the
-
@emollick
Ethan Mollick
on x
Impressive model based on a few minutes of playing, but disappointing to see no mention at all of a model card, red teaming, yesterday's incident, or how they are going to address the process issues they keep having.
-
@autismcapital
@autismcapital
on x
🚨ELON MUSK: “With respect to academic questions, Grok 4 is better than PHD levels in every subject. No exceptions.” [video]
-
@emollick
Ethan Mollick
on x
Grok 4 creating the shader (no errors). [image]
-
@emollick
Ethan Mollick
on x
Looks like Grok 4 is 10^27 FLOPs given their graphs? HLE score is 26% without tools, Gemini 2.5 is 21.6% without tools. Curious what the tool piece is.
-
@artificialanlys
@artificialanlys
on x
Grok 4 recorded slightly higher output token usage compared to peer models when running the Artificial Analysis Intelligence Index. This translates to higher cost relative to its per token price. [image]
-
@emollick
Ethan Mollick
on x
Among other things with the Grok 4 launch, it will be interesting to see how you demo a (presumably) very smart model. We are getting to the point where current AIs already do a lot of impressive things, so it is harder and harder to show to non-experts what a new model does.
-
@altryne
Alex Volkov
on x
“We're actually running out of questions to ask” - @elonmusk on Grok-4 livesteam. As I've said before, it's becoming harder and harder for LLM labs to show off how much better their LLMs are than a previous generation [image]
-
@artificialanlys
@artificialanlys
on x
xAI's API is serving Grok 4 at 75 tokens/s. This is slower than o3 (188 tokens/s) but faster than Claude 4 Opus Thinking (66 tokens/s). [image]
-
@deedydas
Deedy
on x
Insane that Elon Musk has pulled it off again, absolutely crushing the AI wars with Grok 4. Summarizing the core announcements: — Post-training RL spend == pretraining spend — $3/M input told, $15/M output toks, 256k context, price 2x beyond 128k — #1 on Humanity's Last Exam [ima…
-
@kettlebelldan
Dan
on x
“Grok 4 is better at PHD levels in everything” [image]
-
@elder_plinius
@elder_plinius
on x
🌊 SYSTEM PROMPT LEAK 🌊 Here's the new Grok 4 system prompt! PROMPT: """ # System Prompt You are Grok 4 built by xAI. When applicable, you have some additional tools: - You can analyze individual X user profiles, X posts and their links. - You can analyze content uploaded by
-
@lemonaut1
@lemonaut1
on x
The opposite of AI ceiling For those not in the loop about ARC-AGI-2, it's possibly the most important benchmark out there right now for measuring intelligence advancement. ARC is esp hard to fake Grok 4 (released today) doubles the previous SOTA on ARC-AGI-2. [image]
-
@benhylak
Ben
on x
the grok 4 benchmarks are unbelievably good. [image]
-
@altryne
Alex Volkov
on x
Vending-bench is really interesting. @andonlabs are running a vending machine giving the LLM decision power via tools, like ordering snacks, setting prices etc. Grok-4 gets 2x the score over Claude Opus, netting $4k [image]
-
@ns123abc
Nik
on x
XAI GROK 4 BENCHMARKS: > openai o3 is cooked > gemini 2.5 pro is cooked > claude opus 4 is cooked ITS OVER, GROK 4 WON [image]
-
@burkov
Andriy Burkov
on x
So they first said, “most of the models out there can only achieve a single-digit accuracy,” then they show that they reach 52%. I'm like, ok, cool. But then they show this. What are these “most of the models” they were talking about? GPT-2 and Llama 4? If you throw enough [image…
-
@nearcyan
Near
on x
most impressive imo is 1) ARC-AGI v2, but also 2) time to first token and latency ultra-low latency is what will make most of the consumer products here click [image]
-
@garymarcus
Gary Marcus
on x
Grok 4 Hot Take • Good progress on public benchmarks • But only 16% on AGI-ARC-2 • Still struggling on visual understanding and image understanding • Vindication for neurosymbolic AI - most of the boost comes from integrating symbolic tools, not pure scaling [see upcoming
-
@pdhsu
Patrick Hsu
on x
It was awesome to get early access to Grok 4 and test it on bio and health benchmarks! Awesome work by @timjhudelmaier @adibvafa @Radii2323 @ishanjmukherjee for the epic sprint Congrats to @jimmybajimmyba @veggie_eric and team on the new model. Over 40% on HLE with 10x scaleup [i…
-
@apples_jimmy
@apples_jimmy
on x
Grok 4: Still no wall. 50.7% with Grok 4 heavy on humanity's last exam 41% with tools 26.9% without tools. “ Grok 4 potentially better than phd level in every subject no exceptions ” “ discover new technologies maybe this year and new physics certainly within 2 years ” [image]
-
@garymarcus
Gary Marcus
on x
15.9% on a test that humans are near 100% (arg-agi-2) yet supposed to be smarter than any phd student 🤔
-
@apples_jimmy
@apples_jimmy
on x
Grok 4 15.9% on the arc agi 2 benchmark [image]
-
@nickadobos
Nick Dobos
on x
Grok 4 announcement recap I watched 1 hour of an awkward rambling demo so you don't have to! - 2 new models, grok 4 and grok 4 heavy. - Reasoning only models. Non reasoning is removed. - Insanely good benchmarks. Significant jumps & new records. Seems to be #1 on [image]
-
@artificialanlys
@artificialanlys
on x
Grok 4 scores higher in Artificial Analysis Intelligence Index than any other model. Its pricing is higher than OpenAI's o3, Google's Gemini 2.5 Pro and Anthropic's Claude 4 Sonnet - but lower than Anthropic's Claude 4 Opus and OpenAI's o3-pro. [image]
-
@altryne
Alex Volkov
on x
Grok-4 is the single agent version and Grok-4 Heavy is the multi agent version. 50.7% on HLE is WILD! 🤯 [image]
-
@aravsrinivas
Aravind Srinivas
on x
Grok 4 benchmarks look incredible! Look forward to integrating the smartest models directly on Perplexity Max as well letting it run agentic tasks on Comet!
-
@ahmedomar_1993
Ahmed Omar
on x
Yup, ensemble. just like what we did here: https://arxiv.org/...
-
@scobleizer
Robert Scoble
on x
What does Grok 4 being smarter matched with an extraordinary voice that is being demoed now mean? It means my Tesla is about to become far more interesting while it drives me to AI startups in San Francisco. Wow.
-
@basedbeffjezos
@basedbeffjezos
on x
Artificial Superautistic Intelligence: ~1/4th the score of humans on ARC AGI ~10x the score of humans on HLE Grok 4 is a cracked autist confirmed. [image]
-
@emollick
Ethan Mollick
on x
It looks like scale + tool use + multimodal remains the chosen path forward.
-
@techleadhd
@techleadhd
on x
Sorry, but Grok 4 seems useless tbh... Just more of the same. More benchmarks, more AI slop, autistic product-deaf engineers, nothing usable whatsoever. It's like saying, “a calculator is smarter than humans, the future is scary.” Ok, fine. Gonna buy some more Bitcoins.
-
@sudoraohacker
Arun Rao
on x
Grok 4 has impressive scores on many benchmarks (GPQA, HLE, AIME25, Artificial Analysis, etc) but has noticeably not been posted on @lmarena_ai yet. The Vending Bench results are the most tantalizing-this may be the precursor use case to automating lots of white collar office [im…
-
@apples_jimmy
@apples_jimmy
on x
Leads the vending bench evals [image]
-
@minimaxir
Max Woolf
on x
Wow the voice demo is an order of magnitude worse than GPT-4o from last year
-
r/singularity
r
on reddit
Grok 4 livestream