OpenAI unveils o3 and o3-mini, trained to “think” before responding via what OpenAI calls a “private chain of thought”, and plans to launch them in early 2025
OpenAI announced its new o3 models on Friday. — In a tweet ahead of its final livestream for its …
TechCrunch Maxwell Zeff
Related Coverage
- OpenAI o3 and o3-mini—12 Days of OpenAI: Day 12 OpenAI on YouTube
- The OpenAI Christmas special: o3 is totally an intelligence, guys! For extremely specific values of ‘intelligent’ Pivot to AI
- Early access for safety testing OpenAI
- OpenAI's o3 model shows major gains through reinforcement learning scaling The Decoder · Matthias Bastian
- OpenAI teases its ‘breakthrough’ next-generation o3 reasoning model Digital Trends · Andrew Tarantola
- OpenAI announces o3 and o3-mini, its next simulated reasoning models Ars Technica · Benj Edwards
- OpenAI Skips o2 and Debuts New o3 ‘Reasoning’ Model Gizmodo · Thomas Maxwell
- OpenAI unveils o3, its next ‘reasoning’ model Quartz · Britney Nguyen
- OpenAI teases new reasoning model—but don't expect to try it soon The Verge · Kylie Robison
- OpenAI is done with Shipmas and staring down daunting challenges for 2025 CNBC · Hayden Field
- OpenAI continues to redefine the industry when it comes to AI. It's new “o3” reasoning model which it now claims are the future compared to LLMs scored 87.5% on the ARC-AGI benchmark which tests if an AI has achieved artificial general intelligence. … @carnage4life@mas.to · Dare Obasanjo
- With the preview announcement by OpenAI of their o3 model class, we can reflect on 2024 as a year of extraordinary progress in AI. … Richard Searle
- I won't join the breathless speculation about whether this is or is not AGI - I'm old enough to remember when the AI goalposts would move every … Mike Lanzetta
- Big news in AI this week: OpenAI just announced their latest model, o3, and it's a game-changer. — The o3 model is designed to solve problems … Costa Kladianos
- Oh, oh, oh indeed. We are excited to share o3, OpenAI's newest reasoning model. It's a massive leap forward, outperforming the toughest of benchmarks … Jan Beke
- AGI? OpenAI has released a new model that they claim (and independent evaluation was done) reached one of the most challenging benchmarks at its best. … Stefan Bauschard
- o3, AGI, the art of the demo, and what you can expect in 2025 Marcus on AI · Gary Marcus
- OpenAI soft-launches AGI with o3 models, Enters Next Phase of AI Analytics India Magazine · Aditi Suresh
- Knowledge nugget of the day: Jetson Orin Nano Super The Indian Express · Khushboo Kumari
- OpenAI o3 Model Is a Message From the Future: Update All You Think You Know About AI The Algorithmic Bridge · Alberto Romero
- OpenAI's o3: The grand finale of AI in 2024 Interconnects · Nathan Lambert
- OpenAI's o3 model aced a test of AI reasoning - but it's still not AGI New Scientist · Jeremy Hsu
- OpenAI o3 breakthrough high score on ARC-AGI-PUB. François Chollet is the co-founder … Simon Willison's Weblog · Simon Willison
- CES 2025: Now With More AI — This is “The AI Economy,” a weekly LinkedIn-first newsletter … Ken Yeung
- Sam Altman says OpenAI's new o3 ‘reasoning’ models begin the ‘next phase’ of AI. Is this AGI? Fortune · Sharon Goldman
- OpenAI introduces o3 and o3 Mini reasoning models Neowin · Pradeep Viswanathan
- The AI talent wars are just getting started The Verge · Alex Heath
- OpenAI announces o3 and o3 mini reasoning models Mashable · Cecily Mauran
- OpenAI unveils o3, its most advanced reasoning model yet The Decoder · Matthias Bastian
- “OpenAI's new o3 system - trained on the ARC-AGI-1 Public Training set - has scored a breakthrough 75.7% on the Semi-Private Evaluation set at our stated public leaderboard $10k compute limit. A high-compute (172x) o3 configuration scored 87.5%. … @remixtures@tldr.nettime.org · Miguel Afonso Caetano
- Last night, OpenAI released it's latest model, breaking the barrier of solving new problems, not previously in the training data. 👇🏽 … Anette Novak
- 75% on ARC-AGI semi-private dataset is insanely good. — Some key takeaways from this article by Chollet: … John Chong Min Tan
- I have been skeptical of LLM reasoning for some time, and have never been on the hype train. But it is impossible now to not be stunned by the progress of OpenAI's soon to be released o3 model. … Rama Vasudevan
- OpenAI's new o3 model scored a breakthrough 87.5% on the ARC-AGI benchmark for general intelligence. … Sébastien Riopel-Murray
- OpenAI O3 breakthrough high score on ARC-AGI-PUB Hacker News
- OpenAI Preps ‘o3’ Reasoning Model The Information
- OpenAI unveils new o3 model: What is it and how is it different from o1? The Indian Express · Bijin Jose
- Recap of OpenAI Highlights Key Updates in 12-Day “Shipmas” InfoQ · Daniel Dominguez
- OpenAI showcases o3, o3-mini models with better reasoning capabilities and more Moneycontrol
- ChatGPT Maker OpenAI Drops o3 Reasoning Model As o1's Successor: Greg Brockman Calls It A ‘Breakthrough’ Benzinga · Ananya Gairola
- OpenAI Upgrades Its Smartest AI Model With Improved Reasoning Skills Wired · Will Knight
- Here's everything OpenAI announced in the past 12 days Digital Trends · Andrew Tarantola
- 12 Days of OpenAI ends with a new model for the new year TechRadar · Eric Hal Schwartz
- OpenAI teases its most powerful reasoning model named o3 Android Headlines · Arthur Brown
- OpenAI unveils o3 and o3 mini — here's why these ‘reasoning’ models are a giant leap Tom's Guide · Amanda Caswell
- OpenAI unveils ‘o3’ reasoning AI models in test phase Reuters · Jaspreet Singh
Discussion
-
@wertwhile
Joel Wertheimer
on bluesky
OpenAI's new o3 model is apparently a big step forward (though still very expensive). It is a bit funny that one of the things AI will do is just replace a lot of software engineering. “Learn to code” maybe had it exactly backwards in terms of jobs of the future. techcrunch.com…
-
@benedictevans
Benedict Evans
on threads
It is fascinating to watch AI labs redefine ‘AGI’ as ‘a good model’ in real time
-
@fchollet
François Chollet
on x
Today OpenAI announced o3, its next-gen reasoning model. We've worked with OpenAI to test it on ARC-AGI, and we believe it represents a significant breakthrough in getting AI to adapt to novel tasks. It scores 75.7% on the semi-private eval in low-compute mode (for $20 per task […
-
@openai
@openai
on x
Day 12: Early evals for OpenAI o3 (yes, we skipped a number) https://openai.com/...
-
@polynoamial
Noam Brown
on x
We announced @OpenAI o1 just 3 months ago. Today, we announced o3. We have every reason to believe this trajectory will continue. [image]
-
@mikeknoop
Mike Knoop
on x
o3 is really special and everyone will need to update their intuition about what AI can/cannot do. while these are still early days, this system shows a genuine increase in intelligence, canaried by ARC-AGI semiprivate v1 scores: * GPT-2 (2019): 0% * GPT-3 (2020): 0% * GPT-4
-
@levie
Aaron Levie
on x
OpenAI just announced o3, their new reasoning model that appears to perform insanely well across benchmarks. There are simply no signs of a slow down in AI right now. [image]
-
@deedydas
Deedy
on x
OpenAI o3 is 2727 on Codeforces which is equivalent to the #175 best human competitive coder on the planet. This is an absolutely superhuman result for AI and technology at large. [image]
-
@fchollet
François Chollet
on x
So, is this AGI? While the new model is very impressive and represents a big milestone on the way towards AGI, I don't believe this is AGI — there's still a fair number of very easy ARC-AGI-1 tasks that o3 can't solve, and we have early indications that ARC-AGI-2 will remain ext…
-
@gdb
Greg Brockman
on x
o3, our latest reasoning model, is a breakthrough, with a step function improvement on our hardest benchmarks. we are starting safety testing & red teaming now.
-
@balajis
Balaji
on x
The Frontier Math benchmark is challenging for Fields Medalists. The state of the art stood at 2% just 2 months ago. It has now been shattered by o3. [image]
-
@fchollet
François Chollet
on x
One very important thing to understand about the future: the economics of AI are about to change completely. We'll soon be in a world where you can turn test-time compute into competence — for the first time in the history of software, marginal cost will become critical.
-
@emostaque
Emad
on x
Any work that can be done on the other side of a computer screen, AI will be able to do at a fraction of the price It's not even about creativity and coming up with recipes like a chef or novel code like a distinguished engineer AI will follow guides better, superior cook 🧑🍳
-
@wgussml
William
on x
o3 enables OpenAI to release a gpt4.5 trained on synthetic data from o1/3 that outperforms o1 without undercutting itself except o3-mini performs at cost/speed par with gpt-4o I've said this the whole time: gpt-5 is the friends we've made along the way
-
@iterintellectus
Vittorio
on x
serious question what should a CS student (or any knowledge worker for that matter) do at this point? even if the model is $2000/month, it's still cheaper than a graduate employee what's the plan now? [image]
-
@emostaque
Emad
on x
My take on o3: the global economy is cooked, we need a new economic and societal framework.
-
@dylan522p
Dylan Patel
on x
“At first you go reasoning slowly, then all at once.” - Noam Browningway
-
@byrnehobart
Byrne Hobart
on x
Pretty good odds in my opinion that come Monday morning, a few people are going to realize that they took a comically expensive early vacation. Maybe I'm wrong, but this seems light a straightforward victory for team trendlines-on-log-graphs.
-
@levie
Aaron Levie
on x
The step function leap in capability for OpenAI's o3 model looks insane. And importantly, the cost per task will inevitably go down precipitously with hardware and model improvements over time. [image]
-
@sama
Sam Altman
on x
seemingly somewhat lost in the noise of today: on many coding tasks, o3-mini will outperform o1 at a massive cost reduction! i expect this trend to continue, but also that the ability to get marginally more performance for exponentially more money will be really strange.
-
@dylan522p
Dylan Patel
on x
Motherfuckers were market buying Nvidia stock cause OpenAI O3 is so fucking good
-
@andrewcurran_
Andrew Curran
on x
The training time between o1 and o3 was only three months, which means o4 is on track for March 2025. Look at the jump between these two generations - if that kind of progress continues every three months, it will be hard to keep calling this a slow takeoff.
-
@abacaj
Anton
on x
open source models are kind of cooked, if it takes this much compute to get the right answer for complex questions there's no shot you can run that “locally”
-
@byrnehobart
Byrne Hobart
on x
Take the Nvidia bear case where OpenAI does this entirely on their own silicon. You still have to assume that Jensen has already gotten phone calls from people collectively representing $100bn+ in annual capex.
-
@byrnehobart
Byrne Hobart
on x
Of all the weird things this market cycle, Nvidia trading at roughly the same price it was when the OpenAI announcement hit has to be the weirdest.
-
@swyx
@swyx
on x
I haven't seen enough people draw lines on the o3-mini chart. OpenAI has found a SECOND scaling law and it is roughly 3x steeper than the o-full models what the eff folks [image]
-
@sherwinwu
Sherwin Wu
on x
Lots of buzz around the o3 ARC-AGI result, but the AIME / Codeforces results are a lot more meaningful to me personally. As someone who spent basically all of middle/high school tryharding at competition math - seeing o3 blow past my best showings is... humbling to say the least …
-
@amasad
Amjad Masad
on x
Based on benchmarks, OpenAI's o3 seems like a genuine breakthrough in AI. Maybe a start of a new paradigm. But what new is also old: under the hood it might be Alpha-zero-style search and evaluate. The author of ARC-AGI benchmark @fchollet speculates on how it works: [image]
-
@amasad
Amjad Masad
on x
I bet Google will fast follow o3 because Demis talked about “AlphaZero-mechanism on top of LLMs” back in February.
-
@alkalinesec
@alkalinesec
on x
o1 appears to easily identify the issue in this code. others have also noted it can correctly determine the input to crash crackaddr. i have also made small modifications of this code and crackaddr to try to trip it up. it still gets it right. [image]
-
@brian_armstrong
Brian Armstrong
on x
Mad respect for OpenAI's progress, but if this version naming is some sort of IQ test, I am stumped on what comes next in the sequence.... 4o -> o1 -> o3 -> ?
-
@kevinroose
Kevin Roose
on x
this is (I think?) a joke, but man it is hard to impress upon non-SF, non-tech people how much the vibe has shifted here and how short timelines are, even people who used to be skeptical of AI progress twice this month I've been asked at a party, “are you feeling the AGI?”
-
@venturetwins
Justine Moore
on x
“They're calling 12 Days of Shipmas a flop? Bring out the AGI.” [image]
-
@amir
Amir Efrati
on x
I think OpenAI's reason for naming it o3 means it can no longer be considered a startup 😀
-
@creatine_cycle
Atlas
on x
the cycle is as follows: >new OpenAI Release >CS majors put on suicide watch and welding courses are saturated >wait this thing is woke >agi definition changes >deloitte headcount grows by 3% >new OpenAI Release
-
@garrytan
Garry Tan
on x
Codegen of this quality in the hands of everyone with a computer is a revelation
-
@avischiffmann
Avi
on x
The only reason OAI doesn't call o3 AGI is because they must want to continue their Microsoft partnership
-
@norabelrose
Nora Belrose
on x
If OpenAI's new o3 model is “successfully aligned,” then it could probably be trusted to supervise more powerful models, allowing us to bootstrap to benevolent superintelligence.
-
@stevenheidel
Steven Heidel
on x
it's beginning to look a lot like AGI🎄
-
@dbreunig
Drew Breunig
on x
What can we take-away from Dec's LLM blitz and o3's arrival? 1. The best models will think longer 2. We're gonna need more reasoning training data 3. There will be an increased focus on inference in 25 4. Builders need to stay flexible; a cheaper or better model arrives tomorrow
-
@kevinweil
Kevin Weil
on x
Day 12: ✨ o3 ✨ o3 is bonkers good, a massive step up from o1 on every one of our hardest benchmarks. It's not ready yet (we're starting safety testing, including with external red teamers) but I'm excited for all of the new products it'll enable when it launches.
-
@krishnanrohit
Rohit
on x
The continuing dominance of AI capabilities are going to further exacerbate the bimodal distribution of outcomes. Capital-havers will dominate as price of labour is lowered. Self-starters will dominate as prize from experimentation increases. Middle gets squeezed.
-
@aravsrinivas
Aravind Srinivas
on x
The biggest winner of the o3 announcement [image]
-
@dbreunig
Drew Breunig
on x
A big question for me is: with o3-class teacher models, how good can we make lightweight models?
-
@benjedwards
Benj Edwards
on x
I think we should call o1-style AI models “simulated reasoning” or “SR models,” since they don't reason like humans, but they do simulate a type of artificial reasoning process that can produce useful results https://arstechnica.com/...
-
@scobleizer
Robert Scoble
on x
OpenAI sets a high bar for 2025. Grok will exceed it. I studied @DickFosbury1. He told me he started worst on his high school track team at the sport of high jumping. His name is now on the sport because he came up with a new way to jump.
-
@sullyomarr
@sullyomarr
on x
Btw I don't say this lightly but SWE in the traditional sense is dead in < 2 years You will still need smart, capable engineers But anything that involves raw coding and no taste is done for o6 will build you basically anything
-
@_aidan_clark_
Aidan Clark
on x
@ren_hongyu killed it To recap the demo (I'm still sweating), o3-mini wrote it's own ChatGPT UI to talk to *itself* via the OpenAI API, we asked o3-mini to write and execute a script in this UI to evaluate *itself* on GPQA, and the resulting script correctly returned 61%. [image]
-
@dorialexander
Alexander Doria
on x
Good show overall. OpenAI does frontier, Google industrial/product scaling and this will trickle down on the entire ecosystem. 2025 looks fine.
-
@inafried
Ina Fried
on x
Here's our story on o3- the new @OpenAI model that has insiders and outsiders citing as proof that generative Ai has not hit a scaling wall. https://www.axios.com/...
-
@garymarcus
Gary Marcus
on x
The fan boys who are declaring victory now clearly never went to graduate school, where you learn to pick apart a bunch of graphs and ask hard questions. Like, what does the top left graph here tell us? [image]
-
@krishnanrohit
Rohit
on x
I want a job at openai now purely to play with o3
-
@garymarcus
Gary Marcus
on x
Excuse me but today was not the mic-drop AGI moment. That moment *will* come. Not soon.
-
@garymarcus
Gary Marcus
on x
Exactly. what was saw today was GPT-4.15, not GPT 5. People were promising me AGI this morning; what I saw ain't that.
-
@miles_brundage
Miles Brundage
on x
Does this mean humans will be able to add 0 value in these areas? Not necessarily - knowing the problem to solve requires insight/data from other domains, and it may be like chess/Go where there's a centaur period where humans can *occasionally* help even if weaker head-to-head.
-
@garymarcus
Gary Marcus
on x
Nice try, but this is goal post shifting. when I proposed the “wall” I didn't talk about ARC. I talked about hallucinations, compositionality, boneheaded errors, etc. Today's demo didn't address any of those and was very light on applications outside of benchmarks in closed
-
@dmdohan
David Dohan
on x
imo the improvements on FrontierMath are even more impressive than ARG-AGI. Jump from 2% to 25% Terence Tao said the dataset should “resist AIs for several years at least” and “These are extremely challenging. I think that in the near term basically the only way to solve them,
-
@krishnanrohit
Rohit
on x
[image]
-
@vikhyatk
Vik
on x
openai spent more money to run an eval on arc-agi than most people spend on a full training run
-
@krishnanrohit
Rohit
on x
Good work OpenAI on dropping a banger on Day 12. Still got the fight!
-
@krishnanrohit
Rohit
on x
Calling time that open source models will soon enough get to this benchmark too, they know the path now
-
@garymarcus
Gary Marcus
on x
From what I can tell from the o3 announcement, all three of these predictions will prove to be correct. The ARC results were particularly impressive. But note that there were no demos around everyday reasoning and that the strongest results were around math problems and coding,
-
@qivshi1
@qivshi1
on x
O3 will be cheap if we just have the will to build the servers...
-
@aj_kourabi
@aj_kourabi
on x
o3 does 25.2% on Frontier Math. Previous models barely got 2%. Here are some sample questions. It is a hard eval (and unpublished). Progress is not slowing down. [image]
-
@jtlin
Justin Lin
on x
Wow, o3 is quite the “holiday update” from @OpenAI. Like @Tesla FSD v13, can't wait to try. But sounds like it might be a while before general availability for both. 😞
-
@sebastienbubeck
Sebastien Bubeck
on x
o3 and o3-mini are my favorite models ever. o3 essentially solves AIME (>90%), GPQA (~90%), ARC-AGI (~90%), and it gets 1/4th of the Frontier Maths. To understand how insane 25% on Frontier Maths is, see this quote by Tim Gowers. The sparks are intensifying ... [image]
-
@itsallentropy
Essam
on x
seems like o3 can solve novel visual problems now, pretty cooooool.
-
@aj_kourabi
@aj_kourabi
on x
o3 results robustly showcase how the piece I helped write with semianalysis is right on so many of the critical topics, take a look if you have not already https://semianalysis.com/...
-
@ryanmorrisonjer
Ryan Morrison
on x
o3 will develop and train o4. This is the start of the takeover.
-
@vikhyatk
Vik
on x
have to say i like this shift from “intelligence too cheap to meter” [image]
-
@modestproposal1
@modestproposal1
on x
can beat the machines for now 🦾
-
@0xkarmatic
Karma
on x
Resurfacing this for you people after looking at the o3 evals results
-
@sporadicalia
@sporadicalia
on x
December 5th: o1 released December 20th: o1 made obsolete by o3 is this what acceleration feels like?
-
@krishnanrohit
Rohit
on x
Google, please drop Gemini Experimental 1206 Thinking model, make this a race!
-
@yuris
Yuri Sagalov
on x
“OpenAI's new o3 model represents a significant leap forward in AI's ability to adapt to novel tasks. This is not merely incremental improvement, but a genuine breakthrough, marking a qualitative shift in AI capabilities compared to the prior limitations of LLMs.” @fchollet [imag…
-
@benbajarin
Ben Bajarin
on x
Is the thinking, at the moment, these long reasoning models are primarly an enterprise thing? Because if we are going to be paying too much for agent work consumers are not going to use them.
-
@sahir2k
Saint
on x
yo @sama gimme o3 access , i'm a safety researcher ( i'm interested in my job safety)
-
@silasalberti
Silas Alberti
on x
o3 scores a 2727 ELO on Codeforces which places it 175th in the global ranking. That's better than ~99.9% of humans on the website (who already tend to be far above average). [image]
-
@altryne
Alex Volkov
on x
o3-mini will support everything that o1 supports, function calling, structured outputs while being significantly cheaper! [image]
-
@krishnanrohit
Rohit
on x
Holy shit they did it! Never been more vindicated that AI has not hit a wall ... [image]
-
@apples_jimmy
@apples_jimmy
on x
Openai has the Mandate of Heaven.
-
@sparkycollier
@sparkycollier
on x
paging @sundarpichai please report to the lab
-
@_fabknowledge_
@_fabknowledge_
on x
Jesus the price
-
@mckaywrigley
Mckay Wrigley
on x
There is absolute no situation in which you will outcompete someone who is using o3 and you are not. This clearly seems like the model that will begin to actually spark a real AGI debate. Based on the numbers they're showing today? Not sure I'd argue against it.
-
@rowancheung
Rowan Cheung
on x
NEW: OpenAI just announced ‘o3’, a breakthrough AI model that significantly surpasses all previous models in benchmarks. —On ARC-AGI: o3 more than triples o1's score on low compute and surpasses a score of 87% —On EpochAI's Frontier Math: o3 set a new record, solving 25.2% of [im…
-
@miles_brundage
Miles Brundage
on x
I've been saying recently that completely superhuman AI math and coding by end of 2025 was plausible - 50/50 or so. Now I'd say it's much more likely than not (o3 is already better than almost all humans).
-
@dmdohan
David Dohan
on x
o3 @ 87.5% on ARC-AGI It was 16 hours at an increase rate of 3.5% an hour to “solved” [image]
-
@mckaywrigley
Mckay Wrigley
on x
@samirayubkhan @OpenAI Someone told me they'd be demoing a crazy Arc result (which they did). Got 87% on o3 high reasoning. Said they're going to continue to build improved version (tldr).
-
@bnj
Ben South
on x
o3: 87.5% Humans: 85% AGI confirmed [image]
-
@dorialexander
Alexander Doria
on x
Ah yes, OpenAI is really not messing around for the last day. Wild. [image]
-
@nim_chimpsky_
Nim
on x
o3 can solve 25% of research level mathematics questions designed by experts out of the box It's literally over lmao [image]
-
@justin_halford_
Justin Halford
on x
o3 got 87% on ARC Eval and 71% on SWE Bench (!) [image]
-
@fernandotherojo
Fernando Rojo
on x
OpenAI o3 absolutely crushing on the arc-agi eval. wow sorry @jerber888 there's a new #1 [image]
-
@seconds_0
@seconds_0
on x
Ok fine everything is forgiven @OpenAI o3 is an insane stepchange
-
@bobmcgrewai
Bob McGrew
on x
Congrats to the research team on the o3 and o3-mini announcements! These are great models. And, yes... you've reached a new high for OpenAI's tradition of truly terrible naming. 😂
-
@deedydas
Deedy
on x
Summary of all the CRAZY benchmark results from OpenAI's most advanced model, o3! SWE-Bench Verified: 71.7% Codeforces rating: 2727 Competition Math: 96.7% PhD level science (GPQA): 87.7% Frontier Math: 25.2% (previous best was 2%) ARC-AGI: 87.5% TRULY SUPERHUMAN. [image]
-
@sullyomarr
@sullyomarr
on x
yeah its over for coding with o3 this is mindboggling looks like the first big jump since gpt4, because these numbers make 0 sense [image]
-
@dorialexander
Alexander Doria
on x
Community reaction is predictible. [image]
-
@iscienceluvr
Tanishq Mathew Abraham, Ph.D.
on x
OpenAI also announces o3-mini! There are different thinking levels that can be chosen It's also quite fast, as demonstrated by their demos [image]
-
@arunv30
Arun Vijayvergiya
on x
O3 is truly incredible. Can't wait for people to try it.
-
@garymarcus
Gary Marcus
on x
Three o3 predictions: - People will initially be amazed - Once they dive in, they will see that it is not reliable - It will work best in closed-domains (like math problems) but less reliably in open-ended domains (like everyday reasoning about the real world).
-
@emollick
Ethan Mollick
on x
Independent evaluations of OpenAI's o3 suggest that it passed benchmarks that were previously considered far out of reach for AI including achieving a score on ARC-AGI that was associated with actually achieving AGI (though the creators of the benchmark don't think it o3 is AGI)
-
@adonis_singh
Adi
on x
o3 is a LARGE model
-
@basedbeffjezos
@basedbeffjezos
on x
o3 it is. [image]
-
@karanganesan
Karan Ganesan
on x
it's crazy we now have thinking time as a parameter for llm apis open ai o3, o3 mini soon what a time to be alive
-
@mattshumer_
Matt Shumer
on x
These o3 evals are... absolutely insane
-
@armanddoma
Armand Domalewski
on x
uh, wow. o3 [image]
-
@smokeawayyy
@smokeawayyy
on x
o3 scored 75.7% on ARC-AGI Public Benchmark within compute requirements. o3 scored 87.5% in High compute mode. Human performance is 85%. [image]
-
@kevinweil
Kevin Weil
on x
o3 with a major breakthrough on ARC-AGI 🤯
-
@dorialexander
Alexander Doria
on x
It gets even wilder. [image]
-
@yanatweets
Yana Welinder
on x
o3 has scored over 85% on ARC-AGI 🤯 Human performance is at 85% In other words: AGI has been achieved in 2024 2025 is going to be wild! [image]
-
@natolambert
Nathan Lambert
on x
OpenAI skips o2, previews o3 scores, and they're truly crazy. Huge progress on the few benchmarks we think are truly hard today. Including ARC AGI. Rip to people who say any of “progress is done,” “scale is done,” or “llms cant reason” 2024 was awesome. I love my job. [image]
-
@deedydas
Deedy
on x
OpenAI o3 does 71.7% on SWE-Bench verified!! And 2727 on Codeforces! Unbelievable. [image]
-
@garymarcus
Gary Marcus
on x
Shocker: “we are not going to make these publicly available today”
-
@jacobandreou
Jacob Andreou
on x
ok well o3 is AGI hope everyone had fun [image]
-
@altryne
Alex Volkov
on x
💥 Evals for @openai new frontier o3 are CRAZY 🤯 25.2 on Frontier Math 96.7 on GPQA 71.7 on SWE-bench verified SOTA on @arcprize - an insane 87.5% (on high compute) !! (Above human performance at 85% threshold!) [image]
-
@modestproposal1
@modestproposal1
on x
“OpenAI's new o3 model represents a significant leap forward in AI's ability to adapt to novel tasks. This is not merely incremental improvement, but a genuine breakthrough, marking a qualitative shift in AI capabilities compared to the prior limitations of LLMs.”
-
@arcprize
@arcprize
on x
New verified ARC-AGI-Pub SoTA! @OpenAI o3 has scored a breakthrough 75.7% on the ARC-AGI Semi-Private Evaluation. And a high-compute o3 configuration (not eligible for ARC-AGI-Pub) scored 87.5% on the Semi-Private Eval. 1/4 [image]
-
@apples_jimmy
@apples_jimmy
on x
Arc-agi BEATEN AT 87.5% !!!
-
@vikhyatk
Vik
on x
looks like we're only getting some evals today? when is o3 rolling out?
-
@altryne
Alex Volkov
on x
@OpenAI @arcprize o3-mini evals - with more thinking time, increasing ELO, outperforming o1 while being much cheaper. [image]
-
@iscienceluvr
Tanishq Mathew Abraham, Ph.D.
on x
Breaking news! OpenAI announces o3! 🔥 Achieves new SOTA on ARC-AGI of 87.5%! Significant jump in the hardest frontier math benchmark (EpochAI) Achieves SOTA on other technical benchmarks like AIME and GPQA-Diamond [image]
-
r/artificial
r
on reddit
One-Minute Daily AI News 12/20/2024
-
@garymarcus
Gary Marcus
on bluesky
How many times do we have to see this same movie, where an AI beats some benchmark and influencers gleefully shout “It's So Over” without even trying out the AI and then on careful inspection the AI turns out to not be robust or reliable? — Thousands? — (It's already been hun…
-
@simonwillison.net
Simon Willison
on bluesky
By far the best coverage of o3 is this essay by François Chollet, it's crammed with interesting insights beyond just reporting on the benchmark score: arcprize.org/blog/oai-o3-... Published my own notes on that here: simonwillison.net/2024/Dec/20/ ...
-
@julianharris
Julian Harris
on bluesky
OpenAI announced o3 that is significantly better than previous systems, according to an independent benchmark org (The Arc Prize) that apparently got access. — Only thing is it's wildly wildly expensive to run. Like its top end system is around $10k per TASK. — arcprize.org/…
-
@zachweinersmith
Zach Weinersmith
on bluesky
This seems like a really big deal? arcprize.org/blog/oai-o3-... Chollet was pretty skeptical this would happen soon, even as of a few months ago. — Skeptics, whatcha got?
-
@stevenheidel.com
Steven Heidel
on bluesky
it's beginning to look a lot like AGI 🎄 arcprize.org/blog/oai-o3-...
-
@luokai
@luokai
on threads
Even though o3 has tackled many challenges that most average humans can't solve, there's still a long road ahead on the journey to AGI. It still messes up on tasks that are pretty simple for humans, revealing a huge gap between where it is and true AGI. …
-
@mergesort
Joe Fabisevich
on threads
Something OpenAI has realized in a way no other foundation model lab has is that your model can be amazing, superhuman even, but if it's not packaged into a product people can use it might as well not exist. ChatGPT supporting Apple Notes is worth far more to people than 100 mor…
-
@moskov
Dustin Moskovitz
on threads
Francois Chollet has long been an LLM skeptic - he seems to be coming around. I wonder if others will follow?
-
@crumbler
Casey Newton
on threads
“All intuition about AI capabilities will need to get updated for o3,” says the founder of the ARC prize — designed to be very difficult for LLMs to solve — after OpenAI took the benchmark from 5% with o1 to 85% today with o3 https://arcprize.org/...
-
@taulogicai
Tau
on x
@GaryMarcus @fchollet Well said. Passing ARC-AGI highlights progress in solving specific challenges, not achieving AGI. True AGI requires reasoning, adaptability, and guaranteed correctness across all tasks, not just isolated benchmarks. The distinction cannot be overstated.
-
@lennysan
Lenny Rachitsky
on x
The best benchmark for tracking progress towards AGI. 2025 is going to be wild.
-
@fchollet
François Chollet
on x
Deep learning did hit that wall, and the natural answer to get past it was deep learning plus search. AI research is about to enter its deep-learning guided program synthesis (or CoT synthesis) arc.
-
@garymarcus
Gary Marcus
on x
Important words from @fchollet: “it is important to note that ARC-AGI is not an acid test for AGI - as we've repeated dozens of times this year. It's a research tool designed to focus attention on the most challenging unsolved problems in AI, a role it has fulfilled well over
-
@tszzl
Roon
on x
“easy for humans, hard for ai” is not a solid design principle for evals imo it leads you towards “judging a fish by how far it can climb a tree” absurdities but maybe it's one orthogonal eval style among many equally important ones
-
@denny_zhou
Denny Zhou
on x
any benchmark—including ARC-AGI—can be rapidly solved, as long as the task provides a clear evaluation metric that can be used as a reward signal during fine-tuning.
-
@eshear
Emmett Shear
on x
If you haven't figured out the joke yet, *any* fixed benchmark will fall rapidly the instant it becomes an optimization target for the top labs. Correspondingly, no fixed benchmark is AGI. AGI is the ability to generalize to an adversarially chosen new benchmark.
-
@garymarcus
Gary Marcus
on x
Starting to feel like most people using the term AGI have lost sight of what the G actually stands for.
-
@burkov
Andriy Burkov
on x
How to achieve AGI in 2024: 1. Define a benchmark with puzzles and call it “The AGI Testing Benchmark.” 2. Fine-tune a VLM to solve these puzzles. 3. Declare the AGI achievement.
-
@sauers_
Sauers
on x
The total compute cost was around $1,600,250, more than the entire prize [image]
-
@geofflewisorg
Geoff Lewis
on x
Think about what becomes *more* valuable in a post-AI world, for we are at its doorstep.
-
@tszzl
Roon
on x
specifically arc agi visual tasks look like nonsense in JSON format and multi modality isn't great and the character manipulation tasks don't work for the same reason models mess up the how many “r"s in strawberry problem (tokenization/BPE)
-
@tszzl
Roon
on x
it's almost adversarially constructed wrt input modalities. for a model to solve these requires a far higher level of intelligence than equivalent human score
-
@fchollet
François Chollet
on x
It will also be extremely important to analyze the strengths and limitations of the new system. Here are some examples of tasks that o3 couldn't solve on high-compute settings (even as it was generating millions of CoT search tokens and consuming thousands of dollars of compute […
-
@miaai_builder
Mia Bookworm
on x
@Techmeme The o3 model's ability to generate and execute its own programs is a massive leap forward in AI capabilities, I'm loving the potential for novel task adaptation!
-
@fchollet
François Chollet
on x
The limitations of specific techniques are predictable and correspondingly lead to plateaus for those techniques. But there is always the next technique, building on top of the pile that's already available. There is enough research investment that there will be no wall.
-
@miles_brundage
Miles Brundage
on x
I'm old enough to remember when getting double digit scores on FrontierMath was considered super hard I'm 6 weeks old
-
@simonw
Simon Willison
on x
Absolutely the most interesting think I've read so far about o3 is this essay by @fchollet https://arcprize.org/... [image]
-
@dbreunig
Drew Breunig
on x
Creating reasoning data for training - by humans or synthetically - will continue to rise in importance.
-
@fchollet
François Chollet
on x
What does this mean for the future of AGI research? For me, the main open question is where the scaling bottlenecks for the techniques behind o3 are going to be. If human-annotated CoT data is a major bottleneck, for instance, capabilities would start to plateau quickly like
-
@fchollet
François Chollet
on x
Two other examples. You can find the full testing data here: https://github.com/... If the topic interests you, take a look at analyzing this data. [image]
-
r/agi
r
on reddit
OpenAI o3 Breakthrough High Score on ARC-AGI-Pub
-
r/mlscaling
r
on reddit
OpenAI o3 Breakthrough High Score on ARC-AGI-Pub
-
r/MachineLearning
r
on reddit
[D] OpenAI o3 87.5% High Score on ARC Prize Challenge
-
r/singularity
r
on reddit
FULL O3 TESTING REPORT
-
@justicar.xyz
Glenn White
on bluesky
“Out of respect” for not getting their asses sued to hell and back. [embedded post]
-
@quinnypig.com
Corey Quinn
on bluesky
And if there's one thing OpenAI is renowned for, it's respecting the intellectual property of others. [embedded post]
-
@aquariusacquah
Ken
on x
as it was written in the ancient texts [image]
-
@fchollet
François Chollet
on x
Cost-efficiency will be the overarching measure guiding deployment decisions. How much are you willing to pay to solve X? The world is once again going to run out of GPUs.
-
@sama
Sam Altman
on x
if you are a safety researcher, please consider applying to help test o3-mini and o3. excited to get these out for general availability soon. extremely proud of all of openai for the work and ingenuity that went into creating these models; they are great.
-
r/slatestarcodex
r
on reddit
OpenAI Unveils More Advanced Reasoning Model in Race With Google