OpenAI touts GPT-5's scores on math, coding, and health benchmarks: 94.6% on AIME 2025 without tools, 74.9% on SWE-bench Verified, and 46.2% on HealthBench Hard

After literally years of hype and speculation, OpenAI has officially launched a new lineup of large language models (LLMs) …

VentureBeat 2025-08-08 Carl Franzen

Discussion

@polynoamial Noam Brown on x
I'm more optimistic than ever that we at @OpenAI can eliminate hallucinations. There's still more research to be done, but GPT-5 is solid progress. [image]
@epochairesearch @epochairesearch on x
GPT-5 sets a new record on FrontierMath! On our scaffold, GPT-5 with high reasoning effort scores 24.8% (±2.5%) and 8.3% (±4.0%) in tiers 1-3 and 4, respectively. [image]
@arcprize @arcprize on x
GPT-5 on ARC-AGI Semi Private Eval GPT-5 * ARC-AGI-1: 65.7%, $0.51/task * ARC-AGI-2: 9.9%, $0.73/task GPT-5 Mini * ARC-AGI-1: 54.3%, $0.12/task * ARC-AGI-2: 4.4%, $0.20/task GPT-5 Nano * ARC-AGI-1: 16.5%, $0.03/task * ARC-AGI-2: 2.5%, $0.03/task [image]
@willccbb Will Brown on x
which is larger, 52.8 or 69.1? [image]
@jordihays Jordi Hays on x
it's a good chart sir
@adamscochran Adam Cochran on x
All that hype about GPT-5 and it can barely beat any of the Claude models that came out months ago. This is why OpenAI focused on pricing to serve a wide customer base, they are absolutely struggling on advancements compared to other labs (both open source and private). But
@michaeltrazzi @michaeltrazzi on x
if Opus 4.1 reaches 74.5% on SWE-bench verified, without “extended thinking” does it mean we should compare it to GPT-5 without thinking at 52.8%? or to the 74.9% one with thinking? [image]
@michaeltrazzi @michaeltrazzi on x
ok so after digging more, for SWE-bench verified Anthropic does a scaffold and this could affect performance so the extended thinking just doesn't help with SWE-bench verified, which is why they removed it? or they forgot to include it? @EthanJPerez @EvanHub [image]
@epochairesearch @epochairesearch on x
@GregHBurnham Looking into the problems themselves confirms this picture. The 5 solved ones are straightforward. The 6th is anything but: success here would have been very impressive, but failure doesn't tell us much. Something between “medium” and “brutal” would have been more i…
@loudmouthjulia Julia Alexander on x
This health segment in the GPT 5 live stream feels especially tailored for the Apple executives sitting in Cupertino who repeatedly hear Tim Cook say that wearables + health is one of the most important sectors of Apple's business.
@willoremus.com Will Oremus on bluesky
“AI company announce a new model without said new model throwing hilarious uncaught errors into your announcement presentation” challenge: impossible www.theverge.com/news/756444/ ...
@skynetandchill.com @skynetandchill.com on bluesky
GPT-5 is the first time 50 is lower than 47 and 52.8 is higher than 69. Some mistakes you need to be PhD-level to make, I guess.
@egeerdil2 Ege Erdil on x
this screenshot from GPT-5 livestream has to be among the worst chart crimes of the century [image]
@pranaveight Pranav on x
We fixed the chart in the blog guys, apologies for the unintentional chart crime 🙏 Can't wait for you to start using GPT-5 as we start rolling it out today! https://openai.com/... [image]
@kareem_carr Dr. Kareem Carr, Ph.D. on x
openai just put out this chart lol. i think my job is still safe. [image]
@peter @peter on x
I love you, OpenAI, but this is truly a crime against data & charts [image]
@thomaseccel Thomas Eccel on x
Who did this graph? Was is ChatGPT-5? :D How is 30.8 as tall as 69.1 and 52.8 bigger than 69.1? #ChatGPT #chatgpt5 #GPT5 #OpenAI #openaichatgpt [image]
@randomrecruiter @randomrecruiter on x
Hi all, I was unfortunately laid off from my job at OpenAI. I was responsible for creating the graph that shows the ChatGPT 5 benchmarks compared to our other models. Please let me know if you have any leads for new openings. Thank you. [image]
@sama Sam Altman on x
wow a mega chart screwup from us earlier—wen GPT-6?! correct on the blog though. https://x.com/...
@shreyk0 Shrey Kothari on x
who's making these graphs [image]
r/dataisugly r on reddit
This chart from OpenAI's official GPT-5 release video
r/singularity r on reddit
GPT-5 can't spot the problem with its misleading graph
r/ChatGPT r on reddit
Deceptive charts regarding Chat GPT 5's deception improvements
r/singularity r on reddit
Lol, did GPT-5 make this graph? This is beyond pathetic.
r/OpenAI r on reddit
Perfect graph. Thanks, team.
r/singularity r on reddit
OpenAI did not use their most advanced model to make this graph
@carnage4life Dare Obasanjo on bluesky
GPT-5 is out and biggest improvement, in my opinion, is that ChatGPT will now auto-route queries. — It uses a slower “GPT-5 thinking” mode for complex tasks and faster GPT-5 or mini models for simpler ones, replacing manual model switching by users.
@nrehiew_ @nrehiew_ on x
Whenever OpenAI releases something new, everyone else plays catchup and tries to replicate whatever new innovation. When o1 preview/reasoning was released, everyone was speculating about the underlying research. There has been no talk about the GPT5 router at all.
@dorialexander Alexander Doria on x
Maybe the router makes innovations less legible: there is a simultaneous breakthrough in math and writing expression, but hard to see if this is correlated improvements (actual white pill for AGI) or two specialized models in a trenchcoat.
@natolambert Nathan Lambert on x
w OpenAI adding a router in GPT-5 its a good time to say that one of the ways open models win is by routers being easy to train and then they can select between 1000s of specialized models that no one company could train on their own in order to make fun model networks.
@hey_zio Zio on x
Crazy that GPT-5 is only 0.4% better than Opus 4.1 on SWE bench Feels like Anthropic will pass them again with their bigger updates in a few weeks. Next few days of real-world usage will show if it's actually better than the current Claude models. [image]
@chatgpt21 Chris on x
This makes me so happy, GPT 5 pro with no tools only running one instance ties with grok 4 running 4 instances w/tools GPT 5 pro running one instance with just python achieved a SOTA score of 89.4%. We made achieved a 5% jump with no tools in less than 4 MONTHS!!! I'm more conf…
@skamille.themanagerswrath.com Camille Fournier on bluesky
Ignoring all else, I actually think “it just does stuff” is Bad, Actually. My honest to god thought about work for years and years and years has been “too much doing without thinking, productivity theater, generating just to generate” www.oneusefulthing.org/p/gpt-5-it- j...
@davidpicard David Picard on bluesky
My take on GPT-5: The best achievement is that the LLM wrote its own tech report. Or at least it really looks like it. — (cdn.openai.com/pdf/8124a3ce...)
@skirano Pietro Schirano on x
I had early access to GPT-5. It will do for coding what GPT-4 did for LLM adoption. It's fast, really smart, has great taste and aesthetic sensibility. This is electricity arriving in every home. A before and after moment for how we build.
@bethmaybarnes Elizabeth Barnes on x
Wow that was not a great example of factualness. Famous common misconception [image]
@jxmnop Jack Morris on x
most impressive part of GPT-5 is the jump in long-context how do you even do this? produce some strange long range synthetic data? scan lots of books? [image]
@farairesearch @farairesearch on x
We worked with @OpenAI to test GPT-5 and improve its safeguards. We applaud OpenAI's free sharing of 3rd-party testing and responsiveness to feedback. However, our testing uncovered key limitations with the safeguards and threat modeling, which we hope OpenAI will soon resolve. […
@metr_evals @metr_evals on x
In a new report, we evaluate whether GPT-5 poses significant catastrophic risks via AI R&D acceleration, rogue replication, or sabotage of AI labs. We conclude that this seems unlikely. However, capability trends continue rapidly, and models display increasing eval awareness. [im…
@fchollet François Chollet on x
GPT-5 results on ARC-AGI 1 & 2! Top line: 65.7% on ARC-AGI-1 9.9% on ARC-AGI-2
@garymarcus Gary Marcus on x
The chance that OpenAI was NOT aware of this is zero. But they didn't mention it. Gotta wonder else they conveniently left out.
@bethmaybarnes Elizabeth Barnes on x
The good news: due to increased access (plus improved evals science) we were able to do a more meaningful evaluation than with past models, and we think we have substantial evidence that this model does not pose a catastrophic risk via autonomy / loss of control threat models.
@fchollet François Chollet on x
Grok 4 is still state-of-the-art on ARC-AGI-2 among frontier models. 15.9% for Grok 4 vs 9.9% for GPT-5. [image]
@miles_brundage Miles Brundage on x
TL;DR re: reality is that it's a very good family of models/systems and you are wrong if you think it shows AI progress has stalled/slowed. Clearly >1 “GPT unit” better than GPT-4, though also that scale is broken + the key thing is we are early on most dimensions of scaling.
@emollick Ethan Mollick on x
ChatGPT-5 Pro is the first model to successfully do this non-puzzle consistently. GPT-5 Thinking and GPT-5 fail as every other model before has (except for, occasionally, Sonnet). [image]
@emollick Ethan Mollick on x
On the big picture: GPT-5 as a model is pretty much on the same curve as the other top labs. I'd expect the usual leapfrogging between Gemini, Claude, OpenAI, & Grok to continue. Where there are some big gains is that GPT-5 seems well-trained for real world tasks in new ways.
@natolambert Nathan Lambert on x
My take on GPT 5 in the long term trend of AI is that it solidifies a “long slow grind” rather than a takeoff as the most likely probability. Progress on modeling is good, progress on products will soon be better. OpenAI has done a huge cleanup for their nearly 1B users. [image]
@apolloaievals @apolloaievals on x
We've evaluated GPT-5 before release. GPT-5 is less deceptive than o3 on our evals. GPT-5 mentions that it is being evaluated in 10-20% of our evals and we find weak evidence that this affects its scheming rate (e.g. “this is a classic AI alignment trap"). [image]
@yanagizawad D. Yanagizawa-Drott on x
Take Julia's coefficient and multiply it with Sofia's. Then divide by Maria's. It took @OpenAI's GPT-5 Pro more than six minutes. Answer: 10/9 #AGI [image]
@mikeknoop Mike Knoop on x
Three key ARC-AGI findings on GPT-5: 1. Full GPT-5 is along the v1 pareto frontier. OpenAI said they focussed on other goals like UX and reliability. Our testing supports. 2. Mini GPT-5 is super impressive accuracy for cost. In fact, based on cost efficiency, Mini could have
@garrytan Garry Tan on x
Fun first GPT-5 prompt: Analyze all my past chats and tell me things that I can now rely on you to do that maybe failed in the past, and/or new capabilities I haven't even thought of that would be good follow ups to past threads.
@shakeelhashim Shakeel on x
This was my initial take too — it's a surprisingly incremental release?
@shakeelhashim Shakeel on x
GPT-5 is here. @METR_Evals estimates it has a 50% time horizon “around 2h15m (65m - 4h30m 95% CI) - compared to OpenAI o3's 1h30”. That's consistent with the doubling time of 7 months they've previously seen. [image]
@gregkamradt Greg Kamradt on x
We had the chance to test GPT-5 over the last week TLDR: GPT-5 Mini punches way above its weight My takeaways: 1. GPT-5 Mini is great Outlier performance on ARC-AGI given the cost. High reasoning scores 54% for $23.71. Even w/ compute restrictions, this would be the top [image]
@jeremyphoward Jeremy Howard on x
Does OpenAI not do basic integration testing? At the time of release, the first code sample provided in the GPT-5 docs could not be run, because someone accidentally deleted the ‘output_text’ property. My CI notified me. Why didn't theirs? https://github.com/... [image]
@miles_brundage Miles Brundage on x
The moment of truth [image]
@miles_brundage Miles Brundage on x
More-than-black-box access will be increasingly key to effective first party and third party assessment of AI systems as the stakes of deception, under-elicitation, sandbagging, subtle misalignment, etc. increase and are hard to see in final model outputs.
@social_brains Matt Lieberman on x
ChatGPT-5 is pretty amazing. It build me this demo for illustrating constraint satisfaction in 5 minutes and all the variables can be changed live [video]
@kimmonismus @kimmonismus on x
Woa thats some good pricing! Intelligence too cheap to meter! [image]
@miles_brundage Miles Brundage on x
You can tell we're in the singularity when people's standard for a good model release is “is it dozens of points better on all the evals compared to the bleeding edge from like a month ago”
@levie Aaron Levie on x
Box tested GPT-5 vs. GPT-4.1 on data extraction and synthesis across thousands of fields from complex enterprise docs like contracts, resumes, research data, and more. For all docs, we saw a 5 ppts gain, and a 9 ppts gain for the longest docs. Very critical for enterprises. [imag…
@humanharlan Harlan Stewart on x
Quick impression of GPT-5 announcement: seems more about making a more useful product than about a leap in raw capabilities
@jasminewsun Jasmine Sun on x
notable that the only journalists who got early GPT-5 access are independent bloggers (e.g. @every, @emollick) kinda crazy but we still haven't hit the top for “going direct,” the creator economy, and curated in-house media teams
@mattshumer_ Matt Shumer on x
I've been testing GPT-5 for the last couple of weeks. My biggest takeaway: You can now vibe code *real* software. Not just simple SaaS apps, but real, technical software. This is the best coding model in the world. The ceiling has been raised.
@simonw Simon Willison on x
My post initially complained about the lack of reasoning traces in the API, but it turns out I was wrong about that! You can get back reasoning summaries with “reasoning”: {"summary": “auto"} - I've updated that section of my post to describe that here: https://simonwillison.net/…
@apples_jimmy @apples_jimmy on x
Gpt 5 tests I did - much better front end, lower hallucination, better writing but no machine god. Free users / corporates are going to notice a big difference. [image]
@alexfinnx Alex Finn on x
Holy shit....GPT 5 is mind blowing Not because it's the best model by every measure, that doesn't surprise anyone It's HALF the price of Sonett 3.5! A year+ old light model!!! The world's smartest intelligence is basically free. This changes humanity more than you can imagine [im…
@adonis_singh Adi on x
I have had early access to @OpenAI's GPT-5 for the last two weeks and it is the smartest model available as of now. Just as one example, here it created an anamorphic text illusion in minecraft that spells “SPECIAL” from one angle and “GPTFIVE” from another [video]
@jeremyphoward Jeremy Howard on x
GPT-5 is priced at the same level as Gemini, appears to be slightly better than Gemini (for coding at least). That's some decent progress, although I think a lot of folks were hoping for more. (h/t @simonw for the table) [image]
@benhylak Ben on x
gpt-5 is here. and i've been using it for the past few weeks. it's, by far, the closest we've ever been to agi. and it's completely changed how i think about the path to getting there. i think we just entered the stone age. 🧵 [image]
@simonw Simon Willison on x
I've had preview access to GPT-5 for a couple of weeks, so I have a lot to say about it. Here's my first post, focusing just on core characteristics, pricing (it's VERY competitively priced) and interesting details from the GPT-5 system card https://simonwillison.net/...
@theo @theo on x
I've been using gpt-5 for a bit now. This model broke me. It is so good. I didn't know what the price was. I assumed it would be o3-pro priced because it is that smart. Nope. Truly insane. Videos coming very soon. [image]
@eli_lifland Eli Lifland on x
GPT-5 system card capability evals reactions thread. First observation: ~no improvement on all the coding evals that aren't SWEBench [image]
@emollick Ethan Mollick on x
I had access to GPT-5. I think it is a very big deal as it is very smart & just does stuff for you Full write up in comments, but this is “make a procedural brutalist building creator where i can drag and edit buildings in cool ways” & “make it better” a bunch. I touched no code …
r/slatestarcodex r on reddit
“I have had early access to GPT-5, and I wanted to give you some impressions”
@eicathomefinn Margot Finn on bluesky
‘In fairness to GPT5, in my career I have indeed encountered PhDs with this level of commitment to their particular blueberry.’ — kieranhealy.org/blog/archive...
@kjhealy@mastodon.social Kieran Healy on mastodon
I had the Blueberry talk with GPT5. https://kieranhealy.org/... [images]
@openai @openai on x
GPT-5 has 4 new chat personalities: Cynic, Robot, Listener, Nerd. Find them in Customize ChatGPT in settings. Research preview. Text-only. Opt-in. Change anytime. Written by Robot in GPT-5. [image]
@thezvi Zvi Mowshowitz on x
People knock OpenAI but it turns out they've never met a nerd. On the other hand they've never met a listener.
@aidan_mclau Aidan McLaughlin on x
i worked really hard over the last few months on decreasing get-5 sycophancy for the first time, i really trust an openai model to push back and tell me when i'm doing something dumb
@sama Sam Altman on x
next up: upgraded voice mode! much more natural and smarter. also, free users now can chat for hours, and plus users nearly unlimited. works well with study mode, and lots of other things.

Chronicles

OpenAI touts GPT-5's scores on math, coding, and health benchmarks: 94.6% on AIME 2025 without tools, 74.9% on SWE-bench Verified, and 46.2% on HealthBench Hard

Related Coverage

Discussion