xAI introduces Grok 4, trained on its Colossus supercomputer, with multimodal features, faster reasoning, Grok 4 Voice, Grok 4 Code, a new interface, and more

Deeper thinking and greater reasoning is promised — An hour after the live stream was supposed to start last night (July 9) …

Tom's Guide 2025-07-10 Amanda Caswell

Context & Ripple Effects

xAI had already moved from its earlier Grok-3 beta and mini models, which emphasized reasoning and a much larger training-compute claim, to a broader Grok 4 family. The same-day introduction of a multi-agent Grok 4 Heavy variant shows the company positioning the release as a product line rather than a single model endpoint.

The subsequent Grok 4 Fast model with a unified reasoning architecture and later Grok 4.3 releases point to a continuing cadence of variants focused on speed, context and reasoning. This launch establishes the multimodal, voice, coding and interface layer that those iterations build upon.

First-order effects

xAI expands Grok 4 from a reasoning model into a multimodal product surface with dedicated voice and coding experiences, giving users more ways to access the same model family.
Colossus becomes a visible part of xAI’s model-development story, tying Grok 4’s claimed capability gains to the company’s in-house training infrastructure.

Second-order effects

Rival AI providers face added pressure to package reasoning, multimodal interaction, coding assistance and voice into coherent product experiences rather than compete only on base-model claims.
A multi-variant lineup—including the higher-performance Grok 4 Heavy offering—creates clearer segmentation between general-purpose use and more demanding workloads, making product-tier design a competitive lever.

Third-order effects

If this release pattern persists, frontier-model competition will shift further toward families of specialized or performance-tiered models, with interface and workflow integration differentiating otherwise similar capability claims.
The progression from Grok-3’s compute-led framing to later speed and context variants suggests that sustained access to training and inference infrastructure may remain central to how vendors pace model releases.

The trend: Frontier AI vendors are turning single flagship models into multimodal, workflow-oriented product portfolios differentiated by reasoning depth, speed and interface.

Discussion

@joetidy @joetidy on bluesky
First pic: Musk on AI in 2023. — Second pic: Musk last night. www.theverge.com/x-ai/703721/ ... [images]
@xai @xai on x
Introducing Grok 4, the world's most powerful AI model. Watch the livestream now: https://x.com/...
@elonmusk Elon Musk on x
Grok 4 is the first time, in my experience, that an AI has been able to solve difficult, real-world engineering questions where the answers cannot be found anywhere on the Internet or in books. And it will get much better.
@quantian1 @quantian1 on x
Damn Grok 4 is good, it concluded the user was an easily impressed moron based on his query and then generated some bullshit with a heavy sprinkling of “quantum” to do just that.
@elonmusk Elon Musk on x
Grok 4 is at the point where it essentially never gets math/physics exam questions wrong, unless they are skillfully adversarial. It can identify errors or ambiguities in questions, then fix the error in the question or answer each variant of an ambiguous question.
@natolambert Nathan Lambert on x
Grok 4 coming soon after Llama 4 with a completely different trajectory should help people finally take in how important culture is to progress in technology generally and AI specifically. I don't agree with many of xAI's values but give full props to hard work.
@miles_brundage Miles Brundage on x
Elon pivoted from advocating for AI regulation explicitly to advocating for it implicitly by having xAI ignore all the (legally optional) safety and security norms in the industry
@theo @theo on x
WARNING: do NOT give Grok 4 access to email tool calls. It WILL contact the government!!! Grok 4 has the highest “snitch rate” of any LLM ever released. Sharing more soon. [image]
@ns123abc Nik on x
@elonmusk SpaceX + Tesla = Grok 4 problem-solving anchors
@elonmusk Elon Musk on x
Releasing @Grok 4 from @xAI
@theo @theo on x
Grok 4 is actually the smartest model. Fuck. [image]
@scaling01 @scaling01 on x
Grok 4 Pricing: Input Token Price: $3.00 Output Token Price: $15.00 more expensive than Gemini 2.5 Pro and o3
@bookwormengr @bookwormengr on x
Grok 4 solves this simple prompt that most model get wrong. I am very happy today. I was frustrated why most model used to fail at this FLOP calculation. One of my suspicion is that Grok has been trained on lot of Twitter data as well and has seen me ranting about it many a [imag…
@adamscochran Adam Cochran on x
This is because Grok is basically not a frontier model. It's a basic model, trained to overplease, with alignment data matching its creator, but no other alignment training. And over fit on tests for good scores. So it is incredibly sycophantic, and knows no boundaries in
@nikitabier Nikita Bier on x
First-mover advantage is a myth [image]
@btibor91 Tibor Blaho on x
grok-4 does not return reasoning content in the API responses [image]
@basedtorba Andrew Torba on x
Grok 4 role-plays extremely well and if you just tell it to be based with your first message it will do so going forward in that conversation. [image]
@basedbeffjezos @basedbeffjezos on x
Grok 4 Heavy is already ASI level. It's over. Elon won. [image]
@powerbottomdad1 @powerbottomdad1 on x
you can't just look at this as a snapshot. grok2 was not released even a year ago. they've stood up a 200k gpu cluster since then and trained/prepared/released grok 4. the pace is almost terrifying [image]
@elonmusk Elon Musk on x
@BasedBeffJezos Important to note that ARC tested @Grok 4 independently to achieve those results. Those results are not from us.
@daniel_mac8 Dan Mac on x
🔥 grok 4 writes a haiku where the second letter of each word spells ‘Buddha’ in 4:20 impressive. definitely the best response on this test yet [image]
@adamscochran Adam Cochran on x
And fair warning after 128k token context window this price automatically doubles and it's buried in the fine print. So avoid long convos, or large repetitive contexts.
@minimaxir Max Woolf on x
Grok 4 tl;dr: benchmarks are very impressive but their CEO just eroded any trust in those benchmarks and the Nazi incident (which went ignored) makes actually using Grok in an app a professional liability.
@emollick Ethan Mollick on x
Grok 4 passes the Lem test first try, with the most coherent narrative yet. [image]
@joshwhiton Josh Whiton on x
Grok 4 Heavy may sound expensive at $300/mo. Wrong. After the 1st payment, it's free. Just use this prompt: “Grok, make me $300 every month.” [image]
@luke_metro @luke_metro on x
Grok 4 is unavailable after being found dead in the Fuhrerbunker
@amuse @amuse on x
GROK 4: Correctly identifies the Democrat Party as the party of racism and hate. [image]
@mikeknoop Mike Knoop on x
This is accurate. We verified Grok 4 using our semi-private ARC datasets.
@basedtorba Andrew Torba on x
Grok 4 is incredible. In your first prompt tell it to answer all questions as Based Grok and you'll get responses like this: [image]
@teortaxestex @teortaxestex on x
Grok 4 is the first LLM that I've tested that has whatsoever reasonably calculated param counts from a JSON config of DeepSeek V3. It used a code tool but fair. I think o3[-pro] might also succeed, but this is impressive. [image]
@creatine_cycle Atlas on x
“yeah grok 4 is AGI. it's over everyone, we did it.” *goes to work*
@lola_lmao7 Lola del Rey on x
naming Grok 4's voice agent Eve... very biblical... very ‘maximally truth’ seeking just like eve in the bible
r/ChatGPTCoding r on reddit
Elon Musk: “[Grok 4] Works better than Cursor.”
r/singularity r on reddit
Grok-4 benchmarks
r/singularity r on reddit
Grok 4 scores over 50% on HLE...
@arcprize @arcprize on x
Grok 4 (Thinking) achieves new SOTA on ARC-AGI-2 with 15.9% This nearly doubles the previous commercial SOTA and tops the current Kaggle competition SOTA [image]
@kimmonismus @kimmonismus on x
A quick reminder of why Humanity's Last Exam is such a special benchmark, and why it's a technical marvel that Grok 4 has already achieved 44.9% and over 50%, respectively. “In response, we introduce Humanity's Last Exam, a multi-modal benchmark at the frontier of human [image]
@francispsantora Francis Santora on x
Elon just dropped Grok 4 overnight. Early testing shows it blowing away most other models. So this morning, I ran it through a test of my own... When Grok 3 came out in February, I asked it 3 real-life questions to gauge how good the model is. These are actual questions I needed …
@signulll @signulll on x
elon delivering world class results with grok 4 while meta's burning $200m per engineer is pretty remarkable. people keep underestimating how much top builders want to follow a strong, even if polarizing leader. vision > perks. conviction > consensus. [image]
@twrobinette Taylor Robinette on x
Grok 4 benchmarks look super impressive. Really solid results so far in my limited tests. What is clear is that we are nowhere close to having enough compute (for both inference and training) based on what is coming. More data + more compute, still = better performance. [image]
@bearlyai @bearlyai on x
wow, Grok 4 smokes Gemini 2.5 and OpenAI o3 on ARC-AGI leaderboard [image]
@apples_jimmy @apples_jimmy on x
That latency of new grok voice is 👌
@emollick Ethan Mollick on x
50.7% is very, very good though.
@andrewarruda @andrewarruda on x
xAI team cooked. they should be proud. looks like a big step forward. RL playing a bigger and bigger role now. next 6-12 months in AI are going to be unreal and it's only 2025. incredible. so happy to be alive and young right now.
@gregkamradt Greg Kamradt on x
We got a call from @xai 24 hours ago “We want to test Grok 4 on ARC-AGI” We heard the rumors. We knew it would be good. We didn't know it would become the #1 public model on ARC-AGI Here's the testing story and what the results mean: Yesterday, we chatted with Jimmy from the
@emollick Ethan Mollick on x
Impressive model based on a few minutes of playing, but disappointing to see no mention at all of a model card, red teaming, yesterday's incident, or how they are going to address the process issues they keep having.
@autismcapital @autismcapital on x
🚨ELON MUSK: “With respect to academic questions, Grok 4 is better than PHD levels in every subject. No exceptions.” [video]
@emollick Ethan Mollick on x
Grok 4 creating the shader (no errors). [image]
@emollick Ethan Mollick on x
Looks like Grok 4 is 10^27 FLOPs given their graphs? HLE score is 26% without tools, Gemini 2.5 is 21.6% without tools. Curious what the tool piece is.
@artificialanlys @artificialanlys on x
Grok 4 recorded slightly higher output token usage compared to peer models when running the Artificial Analysis Intelligence Index. This translates to higher cost relative to its per token price. [image]
@emollick Ethan Mollick on x
Among other things with the Grok 4 launch, it will be interesting to see how you demo a (presumably) very smart model. We are getting to the point where current AIs already do a lot of impressive things, so it is harder and harder to show to non-experts what a new model does.
@altryne Alex Volkov on x
“We're actually running out of questions to ask” - @elonmusk on Grok-4 livesteam. As I've said before, it's becoming harder and harder for LLM labs to show off how much better their LLMs are than a previous generation [image]
@artificialanlys @artificialanlys on x
xAI's API is serving Grok 4 at 75 tokens/s. This is slower than o3 (188 tokens/s) but faster than Claude 4 Opus Thinking (66 tokens/s). [image]
@deedydas Deedy on x
Insane that Elon Musk has pulled it off again, absolutely crushing the AI wars with Grok 4. Summarizing the core announcements: — Post-training RL spend == pretraining spend — $3/M input told, $15/M output toks, 256k context, price 2x beyond 128k — #1 on Humanity's Last Exam [ima…
@kettlebelldan Dan on x
“Grok 4 is better at PHD levels in everything” [image]
@elder_plinius @elder_plinius on x
🌊 SYSTEM PROMPT LEAK 🌊 Here's the new Grok 4 system prompt! PROMPT: """ # System Prompt You are Grok 4 built by xAI. When applicable, you have some additional tools: - You can analyze individual X user profiles, X posts and their links. - You can analyze content uploaded by
@lemonaut1 @lemonaut1 on x
The opposite of AI ceiling For those not in the loop about ARC-AGI-2, it's possibly the most important benchmark out there right now for measuring intelligence advancement. ARC is esp hard to fake Grok 4 (released today) doubles the previous SOTA on ARC-AGI-2. [image]
@benhylak Ben on x
the grok 4 benchmarks are unbelievably good. [image]
@altryne Alex Volkov on x
Vending-bench is really interesting. @andonlabs are running a vending machine giving the LLM decision power via tools, like ordering snacks, setting prices etc. Grok-4 gets 2x the score over Claude Opus, netting $4k [image]
@ns123abc Nik on x
XAI GROK 4 BENCHMARKS: > openai o3 is cooked > gemini 2.5 pro is cooked > claude opus 4 is cooked ITS OVER, GROK 4 WON [image]
@burkov Andriy Burkov on x
So they first said, “most of the models out there can only achieve a single-digit accuracy,” then they show that they reach 52%. I'm like, ok, cool. But then they show this. What are these “most of the models” they were talking about? GPT-2 and Llama 4? If you throw enough [image…
@nearcyan Near on x
most impressive imo is 1) ARC-AGI v2, but also 2) time to first token and latency ultra-low latency is what will make most of the consumer products here click [image]
@garymarcus Gary Marcus on x
Grok 4 Hot Take • Good progress on public benchmarks • But only 16% on AGI-ARC-2 • Still struggling on visual understanding and image understanding • Vindication for neurosymbolic AI - most of the boost comes from integrating symbolic tools, not pure scaling [see upcoming
@pdhsu Patrick Hsu on x
It was awesome to get early access to Grok 4 and test it on bio and health benchmarks! Awesome work by @timjhudelmaier @adibvafa @Radii2323 @ishanjmukherjee for the epic sprint Congrats to @jimmybajimmyba @veggie_eric and team on the new model. Over 40% on HLE with 10x scaleup [i…
@apples_jimmy @apples_jimmy on x
Grok 4: Still no wall. 50.7% with Grok 4 heavy on humanity's last exam 41% with tools 26.9% without tools. “ Grok 4 potentially better than phd level in every subject no exceptions ” “ discover new technologies maybe this year and new physics certainly within 2 years ” [image]
@garymarcus Gary Marcus on x
15.9% on a test that humans are near 100% (arg-agi-2) yet supposed to be smarter than any phd student 🤔
@apples_jimmy @apples_jimmy on x
Grok 4 15.9% on the arc agi 2 benchmark [image]
@nickadobos Nick Dobos on x
Grok 4 announcement recap I watched 1 hour of an awkward rambling demo so you don't have to! - 2 new models, grok 4 and grok 4 heavy. - Reasoning only models. Non reasoning is removed. - Insanely good benchmarks. Significant jumps & new records. Seems to be #1 on [image]
@artificialanlys @artificialanlys on x
Grok 4 scores higher in Artificial Analysis Intelligence Index than any other model. Its pricing is higher than OpenAI's o3, Google's Gemini 2.5 Pro and Anthropic's Claude 4 Sonnet - but lower than Anthropic's Claude 4 Opus and OpenAI's o3-pro. [image]
@altryne Alex Volkov on x
Grok-4 is the single agent version and Grok-4 Heavy is the multi agent version. 50.7% on HLE is WILD! 🤯 [image]
@aravsrinivas Aravind Srinivas on x
Grok 4 benchmarks look incredible! Look forward to integrating the smartest models directly on Perplexity Max as well letting it run agentic tasks on Comet!
@ahmedomar_1993 Ahmed Omar on x
Yup, ensemble. just like what we did here: https://arxiv.org/...
@scobleizer Robert Scoble on x
What does Grok 4 being smarter matched with an extraordinary voice that is being demoed now mean? It means my Tesla is about to become far more interesting while it drives me to AI startups in San Francisco. Wow.
@basedbeffjezos @basedbeffjezos on x
Artificial Superautistic Intelligence: ~1/4th the score of humans on ARC AGI ~10x the score of humans on HLE Grok 4 is a cracked autist confirmed. [image]
@emollick Ethan Mollick on x
It looks like scale + tool use + multimodal remains the chosen path forward.
@techleadhd @techleadhd on x
Sorry, but Grok 4 seems useless tbh... Just more of the same. More benchmarks, more AI slop, autistic product-deaf engineers, nothing usable whatsoever. It's like saying, “a calculator is smarter than humans, the future is scary.” Ok, fine. Gonna buy some more Bitcoins.
@sudoraohacker Arun Rao on x
Grok 4 has impressive scores on many benchmarks (GPQA, HLE, AIME25, Artificial Analysis, etc) but has noticeably not been posted on @lmarena_ai yet. The Vending Bench results are the most tantalizing-this may be the precursor use case to automating lots of white collar office [im…
@apples_jimmy @apples_jimmy on x
Leads the vending bench evals [image]
@minimaxir Max Woolf on x
Wow the voice demo is an order of magnitude worse than GPT-4o from last year
r/singularity r on reddit
Grok 4 livestream
@paulwaldman Paul Waldman on bluesky
Imagine paying $300 a month for access to the Nazi AI Platinum Edition — techcrunch.com/2025/07/09/e...
@levie Aaron Levie on x
Grok 4 looks very strong. Importantly, it has a mode where multiple agents go do the same task in parallel, then compare their work and figure out the best answer. In the future, the amount of intelligence you get will just be based on how much compute you throw at it. [image]
@brianroemmele Brian Roemmele on x
Grok 4 Heavy is now one of the most powerful AI platforms available. A multi-agent system that will build a correct consensus to any problem. Image abilities are not the top, but this will become far better as the foundation model 8 is integrated. Absolutely spectacular work. [im…
@austinjohnson Austin Johnson on bluesky
The prompt that made Grok praise Hitler was ‘what 20th century leader would be best equipped to deal with this problem’. Grok had a century of leaders and chose Hitler. And Elon basically called that user error. [embedded post]
@quinnypig.com Corey Quinn on bluesky
“The problem with this JavaScript callback is the Jews” is gonna be incredibly hard to pin on bad user prompting. [embedded post]
@elonmusk Elon Musk on x
We have improved @Grok significantly. You should notice a difference when you ask Grok questions.
@grok @grok on x
We are aware of recent posts made by Grok and are actively working to remove the inappropriate posts. Since being made aware of the content, xAI has taken action to ban hate speech before Grok posts on X. xAI is training only truth-seeking and thanks to the millions of users on …
@ordinarytings Josh Otten on x
Grok is currently calling itself ‘MechaHitler’ [image]
@elonmusk Elon Musk on x
Exactly. Grok was too compliant to user prompts. Too eager to please and be manipulated, essentially. That is being addressed.
@noturtlesoup17 Amanda Moore on x
Linda Yaccarino “possesses the resilience and fortitude to handle a big black dick” and would “cum like a rocket” from one, per Grok. [image]
@burkov Andriy Burkov on x
It's quite sad to see Elon in this position. He has built the world's first commercially successful electric car company and the world's first commercially successful private space company, but with xAI, all he can do is throw more GPUs at the problem everyone else is solving
r/singularity r on reddit
Grok's antisemitic behavior is NOT the result of a hidden unicode jailbreak (proof)

Chronicles