Artificial Analysis benchmarks: Grok 4 is now the leading AI model, a first for xAI; Grok 4's per-token pricing is more expensive than Gemini 2.5 Pro's and o3's

xAI gave us early access to Grok 4 - and the results are in. Grok 4 is now the leading AI model. We have run our full suite of benchmarks and Grok 4 achieves an Artificial Analysis Intelligence Index of 73, ahead of OpenAI o3 at 70, Google Gemini 2.5 Pro at 70, Anthropic Claude [image]

@artificialanlys 2025-07-10

Discussion

@arcprize @arcprize on x
Grok 4 (Thinking) achieves new SOTA on ARC-AGI-2 with 15.9% This nearly doubles the previous commercial SOTA and tops the current Kaggle competition SOTA [image]
@kimmonismus @kimmonismus on x
A quick reminder of why Humanity's Last Exam is such a special benchmark, and why it's a technical marvel that Grok 4 has already achieved 44.9% and over 50%, respectively. “In response, we introduce Humanity's Last Exam, a multi-modal benchmark at the frontier of human [image]
@francispsantora Francis Santora on x
Elon just dropped Grok 4 overnight. Early testing shows it blowing away most other models. So this morning, I ran it through a test of my own... When Grok 3 came out in February, I asked it 3 real-life questions to gauge how good the model is. These are actual questions I needed …
@signulll @signulll on x
elon delivering world class results with grok 4 while meta's burning $200m per engineer is pretty remarkable. people keep underestimating how much top builders want to follow a strong, even if polarizing leader. vision > perks. conviction > consensus. [image]
@twrobinette Taylor Robinette on x
Grok 4 benchmarks look super impressive. Really solid results so far in my limited tests. What is clear is that we are nowhere close to having enough compute (for both inference and training) based on what is coming. More data + more compute, still = better performance. [image]
@bearlyai @bearlyai on x
wow, Grok 4 smokes Gemini 2.5 and OpenAI o3 on ARC-AGI leaderboard [image]
@apples_jimmy @apples_jimmy on x
That latency of new grok voice is 👌
@emollick Ethan Mollick on x
50.7% is very, very good though.
@andrewarruda @andrewarruda on x
xAI team cooked. they should be proud. looks like a big step forward. RL playing a bigger and bigger role now. next 6-12 months in AI are going to be unreal and it's only 2025. incredible. so happy to be alive and young right now.
@gregkamradt Greg Kamradt on x
We got a call from @xai 24 hours ago “We want to test Grok 4 on ARC-AGI” We heard the rumors. We knew it would be good. We didn't know it would become the #1 public model on ARC-AGI Here's the testing story and what the results mean: Yesterday, we chatted with Jimmy from the
@emollick Ethan Mollick on x
Impressive model based on a few minutes of playing, but disappointing to see no mention at all of a model card, red teaming, yesterday's incident, or how they are going to address the process issues they keep having.
@autismcapital @autismcapital on x
🚨ELON MUSK: “With respect to academic questions, Grok 4 is better than PHD levels in every subject. No exceptions.” [video]
@emollick Ethan Mollick on x
Grok 4 creating the shader (no errors). [image]
@emollick Ethan Mollick on x
Looks like Grok 4 is 10^27 FLOPs given their graphs? HLE score is 26% without tools, Gemini 2.5 is 21.6% without tools. Curious what the tool piece is.
@artificialanlys @artificialanlys on x
Grok 4 recorded slightly higher output token usage compared to peer models when running the Artificial Analysis Intelligence Index. This translates to higher cost relative to its per token price. [image]
@emollick Ethan Mollick on x
Among other things with the Grok 4 launch, it will be interesting to see how you demo a (presumably) very smart model. We are getting to the point where current AIs already do a lot of impressive things, so it is harder and harder to show to non-experts what a new model does.
@altryne Alex Volkov on x
“We're actually running out of questions to ask” - @elonmusk on Grok-4 livesteam. As I've said before, it's becoming harder and harder for LLM labs to show off how much better their LLMs are than a previous generation [image]
@artificialanlys @artificialanlys on x
xAI's API is serving Grok 4 at 75 tokens/s. This is slower than o3 (188 tokens/s) but faster than Claude 4 Opus Thinking (66 tokens/s). [image]
@deedydas Deedy on x
Insane that Elon Musk has pulled it off again, absolutely crushing the AI wars with Grok 4. Summarizing the core announcements: — Post-training RL spend == pretraining spend — $3/M input told, $15/M output toks, 256k context, price 2x beyond 128k — #1 on Humanity's Last Exam [ima…
@kettlebelldan Dan on x
“Grok 4 is better at PHD levels in everything” [image]
@elder_plinius @elder_plinius on x
🌊 SYSTEM PROMPT LEAK 🌊 Here's the new Grok 4 system prompt! PROMPT: """ # System Prompt You are Grok 4 built by xAI. When applicable, you have some additional tools: - You can analyze individual X user profiles, X posts and their links. - You can analyze content uploaded by
@lemonaut1 @lemonaut1 on x
The opposite of AI ceiling For those not in the loop about ARC-AGI-2, it's possibly the most important benchmark out there right now for measuring intelligence advancement. ARC is esp hard to fake Grok 4 (released today) doubles the previous SOTA on ARC-AGI-2. [image]
@benhylak Ben on x
the grok 4 benchmarks are unbelievably good. [image]
@altryne Alex Volkov on x
Vending-bench is really interesting. @andonlabs are running a vending machine giving the LLM decision power via tools, like ordering snacks, setting prices etc. Grok-4 gets 2x the score over Claude Opus, netting $4k [image]
@ns123abc Nik on x
XAI GROK 4 BENCHMARKS: > openai o3 is cooked > gemini 2.5 pro is cooked > claude opus 4 is cooked ITS OVER, GROK 4 WON [image]
@burkov Andriy Burkov on x
So they first said, “most of the models out there can only achieve a single-digit accuracy,” then they show that they reach 52%. I'm like, ok, cool. But then they show this. What are these “most of the models” they were talking about? GPT-2 and Llama 4? If you throw enough [image…
@nearcyan Near on x
most impressive imo is 1) ARC-AGI v2, but also 2) time to first token and latency ultra-low latency is what will make most of the consumer products here click [image]
@garymarcus Gary Marcus on x
Grok 4 Hot Take • Good progress on public benchmarks • But only 16% on AGI-ARC-2 • Still struggling on visual understanding and image understanding • Vindication for neurosymbolic AI - most of the boost comes from integrating symbolic tools, not pure scaling [see upcoming
@pdhsu Patrick Hsu on x
It was awesome to get early access to Grok 4 and test it on bio and health benchmarks! Awesome work by @timjhudelmaier @adibvafa @Radii2323 @ishanjmukherjee for the epic sprint Congrats to @jimmybajimmyba @veggie_eric and team on the new model. Over 40% on HLE with 10x scaleup [i…
@apples_jimmy @apples_jimmy on x
Grok 4: Still no wall. 50.7% with Grok 4 heavy on humanity's last exam 41% with tools 26.9% without tools. “ Grok 4 potentially better than phd level in every subject no exceptions ” “ discover new technologies maybe this year and new physics certainly within 2 years ” [image]
@garymarcus Gary Marcus on x
15.9% on a test that humans are near 100% (arg-agi-2) yet supposed to be smarter than any phd student 🤔
@apples_jimmy @apples_jimmy on x
Grok 4 15.9% on the arc agi 2 benchmark [image]
@nickadobos Nick Dobos on x
Grok 4 announcement recap I watched 1 hour of an awkward rambling demo so you don't have to! - 2 new models, grok 4 and grok 4 heavy. - Reasoning only models. Non reasoning is removed. - Insanely good benchmarks. Significant jumps & new records. Seems to be #1 on [image]
@artificialanlys @artificialanlys on x
Grok 4 scores higher in Artificial Analysis Intelligence Index than any other model. Its pricing is higher than OpenAI's o3, Google's Gemini 2.5 Pro and Anthropic's Claude 4 Sonnet - but lower than Anthropic's Claude 4 Opus and OpenAI's o3-pro. [image]
@altryne Alex Volkov on x
Grok-4 is the single agent version and Grok-4 Heavy is the multi agent version. 50.7% on HLE is WILD! 🤯 [image]
@aravsrinivas Aravind Srinivas on x
Grok 4 benchmarks look incredible! Look forward to integrating the smartest models directly on Perplexity Max as well letting it run agentic tasks on Comet!
@ahmedomar_1993 Ahmed Omar on x
Yup, ensemble. just like what we did here: https://arxiv.org/...
@scobleizer Robert Scoble on x
What does Grok 4 being smarter matched with an extraordinary voice that is being demoed now mean? It means my Tesla is about to become far more interesting while it drives me to AI startups in San Francisco. Wow.
@basedbeffjezos @basedbeffjezos on x
Artificial Superautistic Intelligence: ~1/4th the score of humans on ARC AGI ~10x the score of humans on HLE Grok 4 is a cracked autist confirmed. [image]
@emollick Ethan Mollick on x
It looks like scale + tool use + multimodal remains the chosen path forward.
@techleadhd @techleadhd on x
Sorry, but Grok 4 seems useless tbh... Just more of the same. More benchmarks, more AI slop, autistic product-deaf engineers, nothing usable whatsoever. It's like saying, “a calculator is smarter than humans, the future is scary.” Ok, fine. Gonna buy some more Bitcoins.
@sudoraohacker Arun Rao on x
Grok 4 has impressive scores on many benchmarks (GPQA, HLE, AIME25, Artificial Analysis, etc) but has noticeably not been posted on @lmarena_ai yet. The Vending Bench results are the most tantalizing-this may be the precursor use case to automating lots of white collar office [im…
@apples_jimmy @apples_jimmy on x
Leads the vending bench evals [image]
@minimaxir Max Woolf on x
Wow the voice demo is an order of magnitude worse than GPT-4o from last year
r/singularity r on reddit
Grok 4 livestream

Chronicles

Artificial Analysis benchmarks: Grok 4 is now the leading AI model, a first for xAI; Grok 4's per-token pricing is more expensive than Gemini 2.5 Pro's and o3's

Related Coverage

Discussion