OpenAI unveils o3 and o3-mini, trained to “think” before responding via what OpenAI calls a “private chain of thought”, and plans to launch them in early 2025
12 Days of OpenAI: Day 12 Naomi Li Gan / Tech in Asia : OpenAI unveils AI model for advanced reasoning Bojan Stojkovski / Interesting Engineering : OpenAI unveils o3 reasoning AI model to tackle complex challenges, compete with Google Matthias Bastian / The Decoder : OpenAI's o3 model shows major gains through reinforcement learning scaling OpenAI : Early access for safety testing Nathan Lambert / Interconnects : OpenAI's o3: The grand finale of AI in 2024 Moneycontrol : OpenAI showcases o3, o3-mini models with better reasoning capabilities and more Stephanie Palazzolo / The Information : OpenAI Announces New Reasoning Model, ‘o3’ ARC Prize : o3, trained on the ARC-AGI-1 Public Training set, scored 87.5% on ARC Prize's Semi-Private Evaluation in a high-compute configuration; GPT-4o scored 5% in 2024 Yahoo Finance : ChatGPT: Everything you need to know about the AI-powered chatbot Bloomberg : OpenAI says safety researchers can sign up for o3 preview today and that it decided not to name the new model o2 “out of respect” for the UK telecom company Kylie Robison / The Verge : OpenAI teases new reasoning model—but don't expect to try it soon Hayden Field / CNBC : OpenAI is done with Shipmas and staring down daunting challenges for 2025 Bluesky: Joel Wertheimer / @wertwhile : OpenAI's new o3 model is apparently a big step forward (though still very expensive). It is a bit funny that one of the things AI will do is just replace a lot of software engineering. “Learn to code” maybe had it exactly backwards in terms of jobs of the future. techcrunch.com/2024/12/20/o... … Mastodon: Miguel Afonso Caetano / @remixtures@tldr.nettime.org : “OpenAI's new o3 system - trained on the ARC-AGI-1 Public Training set - has scored a breakthrough 75.7% on the Semi-Private Evaluation set at our stated public leaderboard $10k compute limit. A high-compute (172x) o3 configuration scored 87.5%. … Dare Obasanjo / @carnage4life@mas.to : OpenAI continues to redefine the industry when it comes to AI. It's new “o3” reasoning model which it now claims are the future compared to LLMs scored 87.5% on the ARC-AGI benchmark which tests if an AI has achieved artificial general intelligence. … Threads: Benedict Evans / @benedictevans : It is fascinating to watch AI labs redefine ‘AGI’ as ‘a good model’ in real time X: François Chollet / @fchollet : Today OpenAI announced o3, its next-gen reasoning model. We've worked with OpenAI to test it on ARC-AGI, and we believe it represents a significant breakthrough in getting AI to adapt to novel tasks. It scores 75.7% on the semi-private eval in low-compute mode (for $20 per task [image] @openai : Day 12: Early evals for OpenAI o3 (yes, we skipped a number) https://openai.com/... @drjimfan : Thoughts about o3: I'll skip the obvious part (extraordinary reasoning, FrontierMath is insanely hard, etc). I think the essence of o3 is about *relaxing a single-point RL super intelligence* to cover more points in the space of useful problems. The world of AI is no stranger to [image] Boaz Barak / @boazbaraktcs : 1/5 Excited that our paper on “deliberative alignment” came out as part of 12 days of @openai! By teaching reasoning models the text of our specifications, and how to reason about them in context, we obtain significantly better robustness while also reducing over refusals. 🧵 [image] Greg Brockman / @gdb : a great use of reasoning models is to improve alignment Noam Brown / @polynoamial : We announced @OpenAI o1 just 3 months ago. Today, we announced o3. We have every reason to believe this trajectory will continue. [image] Deedy / @deedydas : OpenAI o3 is 2727 on Codeforces which is equivalent to the #175 best human competitive coder on the planet. This is an absolutely superhuman result for AI and technology at large. [image] François Chollet / @fchollet : So, is this AGI? While the new model is very impressive and represents a big milestone on the way towards AGI, I don't believe this is AGI — there's still a fair number of very easy ARC-AGI-1 tasks that o3 can't solve, and we have early indications that ARC-AGI-2 will remain extremely challenging for o3. @alkalinesec : o1 appears to easily identify the issue in this code. others have also noted it can correctly determine the input to crash crackaddr. i have also made small modifications of this code and crackaddr to try to trip it up. it still gets it right. [image] Mike Knoop / @mikeknoop : o3 is really special and everyone will need to update their intuition about what AI can/cannot do. while these are still early days, this system shows a genuine increase in intelligence, canaried by ARC-AGI semiprivate v1 scores: * GPT-2 (2019): 0% * GPT-3 (2020): 0% * GPT-4 Amir Efrati / @amir : I think OpenAI's reason for naming it o3 means it can no longer be considered a startup 😀 Aaron Levie / @levie : OpenAI just announced o3, their new reasoning model that appears to perform insanely well across benchmarks. There are simply no signs of a slow down in AI right now. [image] Amjad Masad / @amasad : Based on benchmarks, OpenAI's o3 seems like a genuine breakthrough in AI. Maybe a start of a new paradigm. But what new is also old: under the hood it might be Alpha-zero-style search and evaluate. The author of ARC-AGI benchmark @fchollet speculates on how it works: [image] Sam Altman / @sama : seemingly somewhat lost in the noise of today: on many coding tasks, o3-mini will outperform o1 at a massive cost reduction! i expect this trend to continue, but also that the ability to get marginally more performance for exponentially more money will be really strange. Atlas / @creatine_cycle : the cycle is as follows: >new OpenAI Release >CS majors put on suicide watch and welding courses are saturated >wait this thing is woke >agi definition changes >deloitte headcount grows by 3% >new OpenAI Release Kevin Roose / @kevinroose : this is (I think?) a joke, but man it is hard to impress upon non-SF, non-tech people how much the vibe has shifted here and how short timelines are, even people who used to be skeptical of AI progress twice this month I've been asked at a party, “are you feeling the AGI?” Greg Brockman / @gdb : o3, our latest reasoning model, is a breakthrough, with a step function improvement on our hardest benchmarks. we are starting safety testing & red teaming now. Andrew Curran / @andrewcurran_ : The training time between o1 and o3 was only three months, which means o4 is on track for March 2025. Look at the jump between these two generations - if that kind of progress continues every three months, it will be hard to keep calling this a slow takeoff. Amjad Masad / @amasad : I bet Google will fast follow o3 because Demis talked about “AlphaZero-mechanism on top of LLMs” back in February. Balaji / @balajis : The Frontier Math benchmark is challenging for Fields Medalists. The state of the art stood at 2% just 2 months ago. It has now been shattered by o3. [image] @swyx : I haven't seen enough people draw lines on the o3-mini chart. OpenAI has found a SECOND scaling law and it is roughly 3x steeper than the o-full models what the eff folks [image] Sherwin Wu / @sherwinwu : Lots of buzz around the o3 ARC-AGI result, but the AIME / Codeforces results are a lot more meaningful to me personally. As someone who spent basically all of middle/high school tryharding at competition math - seeing o3 blow past my best showings is... humbling to say the least [image] Byrne Hobart / @byrnehobart : Pretty good odds in my opinion that come Monday morning, a few people are going to realize that they took a comically expensive early vacation. Maybe I'm wrong, but this seems light a straightforward victory for team trendlines-on-log-graphs. Byrne Hobart / @byrnehobart : Take the Nvidia bear case where OpenAI does this entirely on their own silicon. You still have to assume that Jensen has already gotten phone calls from people collectively representing $100bn+ in annual capex. Byrne Hobart / @byrnehobart : Of all the weird things this market cycle, Nvidia trading at roughly the same price it was when the OpenAI announcement hit has to be the weirdest. Nora Belrose / @norabelrose : If OpenAI's new o3 model is “successfully aligned,” then it could probably be trusted to supervise more powerful models, allowing us to bootstrap to benevolent superintelligence. Aaron Levie / @levie : The step function leap in capability for OpenAI's o3 model looks insane. And importantly, the cost per task will inevitably go down precipitously with hardware and model improvements over time. [image] Emad / @emostaque : Any work that can be done on the other side of a computer screen, AI will be able to do at a fraction of the price It's not even about creativity and coming up with recipes like a chef or novel code like a distinguished engineer AI will follow guides better, superior cook 🧑🍳 Vittorio / @iterintellectus : serious question what should a CS student (or any knowledge worker for that matter) do at this point? even if the model is $2000/month, it's still cheaper than a graduate employee what's the plan now? [image] Brian Armstrong / @brian_armstrong : Mad respect for OpenAI's progress, but if this version naming is some sort of IQ test, I am stumped on what comes next in the sequence.... 4o -> o1 -> o3 -> ? Anton / @abacaj : open source models are kind of cooked, if it takes this much compute to get the right answer for complex questions there's no shot you can run that “locally” Garry Tan / @garrytan : Codegen of this quality in the hands of everyone with a computer is a revelation Steven Heidel / @stevenheidel : it's beginning to look a lot like AGI🎄 Emad / @emostaque : My take on o3: the global economy is cooked, we need a new economic and societal framework. François Chollet / @fchollet : One very important thing to understand about the future: the economics of AI are about to change completely. We'll soon be in a world where you can turn test-time compute into competence — for the first time in the history of software, marginal cost will become critical. William / @wgussml : o3 enables OpenAI to release a gpt4.5 trained on synthetic data from o1/3 that outperforms o1 without undercutting itself except o3-mini performs at cost/speed par with gpt-4o I've said this the whole time: gpt-5 is the friends we've made along the way Justine Moore / @venturetwins : “They're calling 12 Days of Shipmas a flop? Bring out the AGI.” [image] Dylan Patel / @dylan522p : “At first you go reasoning slowly, then all at once.” - Noam Browningway Dylan Patel / @dylan522p : Motherfuckers were market buying Nvidia stock cause OpenAI O3 is so fucking good Avi / @avischiffmann : The only reason OAI doesn't call o3 AGI is because they must want to continue their Microsoft partnership Drew Breunig / @dbreunig : What can we take-away from Dec's LLM blitz and o3's arrival? 1. The best models will think longer 2. We're gonna need more reasoning training data 3. There will be an increased focus on inference in 25 4. Builders need to stay flexible; a cheaper or better model arrives tomorrow Kevin Weil / @kevinweil : Day 12: ✨ o3 ✨ o3 is bonkers good, a massive step up from o1 on every one of our hardest benchmarks. It's not ready yet (we're starting safety testing, including with external red teamers) but I'm excited for all of the new products it'll enable when it launches. Rohit / @krishnanrohit : The continuing dominance of AI capabilities are going to further exacerbate the bimodal distribution of outcomes. Capital-havers will dominate as price of labour is lowered. Self-starters will dominate as prize from experimentation increases. Middle gets squeezed. Aravind Srinivas / @aravsrinivas : The biggest winner of the o3 announcement [image] Drew Breunig / @dbreunig : A big question for me is: with o3-class teacher models, how good can we make lightweight models? Benj Edwards / @benjedwards : I think we should call o1-style AI models “simulated reasoning” or “SR models,” since they don't reason like humans, but they do simulate a type of artificial reasoning process that can produce useful results https://arstechnica.com/... Robert Scoble / @scobleizer : OpenAI sets a high bar for 2025. Grok will exceed it. I studied @DickFosbury1. He told me he started worst on his high school track team at the sport of high jumping. His name is now on the sport because he came up with a new way to jump. @sullyomarr : Btw I don't say this lightly but SWE in the traditional sense is dead in < 2 years You will still need smart, capable engineers But anything that involves raw coding and no taste is done for o6 will build you basically anything Aidan Clark / @_aidan_clark_ : @ren_hongyu killed it To recap the demo (I'm still sweating), o3-mini wrote it's own ChatGPT UI to talk to *itself* via the OpenAI API, we asked o3-mini to write and execute a script in this UI to evaluate *itself* on GPQA, and the resulting script correctly returned 61%. [image] Alexander Doria / @dorialexander : Good show overall. OpenAI does frontier, Google industrial/product scaling and this will trickle down on the entire ecosystem. 2025 looks fine. Ina Fried / @inafried : Here's our story on o3- the new @OpenAI model that has insiders and outsiders citing as proof that generative Ai has not hit a scaling wall. https://www.axios.com/... Gary Marcus / @garymarcus : The fan boys who are declaring victory now clearly never went to graduate school, where you learn to pick apart a bunch of graphs and ask hard questions. Like, what does the top left graph here tell us? [image] Rohit / @krishnanrohit : I want a job at openai now purely to play with o3 Gary Marcus / @garymarcus : Excuse me but today was not the mic-drop AGI moment. That moment *will* come. Not soon. Gary Marcus / @garymarcus : Exactly. what was saw today was GPT-4.15, not GPT 5. People were promising me AGI this morning; what I saw ain't that. Miles Brundage / @miles_brundage : Does this mean humans will be able to add 0 value in these areas? Not necessarily - knowing the problem to solve requires insight/data from other domains, and it may be like chess/Go where there's a centaur period where humans can *occasionally* help even if weaker head-to-head. Gary Marcus / @garymarcus : Nice try, but this is goal post shifting. when I proposed the “wall” I didn't talk about ARC. I talked about hallucinations, compositionality, boneheaded errors, etc. Today's demo didn't address any of those and was very light on applications outside of benchmarks in closed David Dohan / @dmdohan : imo the improvements on FrontierMath are even more impressive than ARG-AGI. Jump from 2% to 25% Terence Tao said the dataset should “resist AIs for several years at least” and “These are extremely challenging. I think that in the near term basically the only way to solve them, Rohit / @krishnanrohit : [image] Vik / @vikhyatk : openai spent more money to run an eval on arc-agi than most people spend on a full training run Rohit / @krishnanrohit : Good work OpenAI on dropping a banger on Day 12. Still got the fight! Rohit / @krishnanrohit : Calling time that open source models will soon enough get to this benchmark too, they know the path now Gary Marcus / @garymarcus : From what I can tell from the o3 announcement, all three of these predictions will prove to be correct. The ARC results were particularly impressive. But note that there were no demos around everyday reasoning and that the strongest results were around math problems and coding, @qivshi1 : O3 will be cheap if we just have the will to build the servers... Justin Lin / @jtlin : Wow, o3 is quite the “holiday update” from @OpenAI. Like @Tesla FSD v13, can't wait to try. But sounds like it might be a while before general availability for both. 😞 Sebastien Bubeck / @sebastienbubeck : o3 and o3-mini are my favorite models ever. o3 essentially solves AIME (>90%), GPQA (~90%), ARC-AGI (~90%), and it gets 1/4th of the Frontier Maths. To understand how insane 25% on Frontier Maths is, see this quote by Tim Gowers. The sparks are intensifying ... [image] @aj_kourabi : o3 does 25.2% on Frontier Math. Previous models barely got 2%. Here are some sample questions. It is a hard eval (and unpublished). Progress is not slowing down. [image] Essam / @itsallentropy : seems like o3 can solve novel visual problems now, pretty cooooool. @aj_kourabi : o3 results robustly showcase how the piece I helped write with semianalysis is right on so many of the critical topics, take a look if you have not already https://semianalysis.com/... Ryan Morrison / @ryanmorrisonjer : o3 will develop and train o4. This is the start of the takeover. Vik / @vikhyatk : have to say i like this shift from “intelligence too cheap to meter” [image] @modestproposal1 : can beat the machines for now 🦾 Karma / @0xkarmatic : Resurfacing this for you people after looking at the o3 evals results @sporadicalia : December 5th: o1 released December 20th: o1 made obsolete by o3 is this what acceleration feels like? Rohit / @krishnanrohit : Google, please drop Gemini Experimental 1206 Thinking model, make this a race! Yuri Sagalov / @yuris : “OpenAI's new o3 model represents a significant leap forward in AI's ability to adapt to novel tasks. This is not merely incremental improvement, but a genuine breakthrough, marking a qualitative shift in AI capabilities compared to the prior limitations of LLMs.” @fchollet [image] Ben Bajarin / @benbajarin : Is the thinking, at the moment, these long reasoning models are primarly an enterprise thing? Because if we are going to be paying too much for agent work consumers are not going to use them. Silas Alberti / @silasalberti : o3 scores a 2727 ELO on Codeforces which places it 175th in the global ranking. That's better than ~99.9% of humans on the website (who already tend to be far above average). [image] Saint / @sahir2k : yo @sama gimme o3 access , i'm a safety researcher ( i'm interested in my job safety) Alex Volkov / @altryne : o3-mini will support everything that o1 supports, function calling, structured outputs while being significantly cheaper! [image] Rohit / @krishnanrohit : Holy shit they did it! Never been more vindicated that AI has not hit a wall ... [image] @apples_jimmy : Openai has the Mandate of Heaven. @sparkycollier : paging @sundarpichai please report to the lab @_fabknowledge_ : Jesus the price Mckay Wrigley / @mckaywrigley : There is absolute no situation in which you will outcompete someone who is using o3 and you are not. This clearly seems like the model that will begin to actually spark a real AGI debate. Based on the numbers they're showing today? Not sure I'd argue against it. Rowan Cheung / @rowancheung : NEW: OpenAI just announced ‘o3’, a breakthrough AI model that significantly surpasses all previous models in benchmarks. —On ARC-AGI: o3 more than triples o1's score on low compute and surpasses a score of 87% —On EpochAI's Frontier Math: o3 set a new record, solving 25.2% of [image] Miles Brundage / @miles_brundage : I've been saying recently that completely superhuman AI math and coding by end of 2025 was plausible - 50/50 or so. Now I'd say it's much more likely than not (o3 is already better than almost all humans). David Dohan / @dmdohan : o3 @ 87.5% on ARC-AGI It was 16 hours at an increase rate of 3.5% an hour to “solved” [image] Mckay Wrigley / @mckaywrigley : @samirayubkhan @OpenAI Someone told me they'd be demoing a crazy Arc result (which they did). Got 87% on o3 high reasoning. Said they're going to continue to build improved version (tldr). Nim / @nim_chimpsky_ : o3 can solve 25% of research level mathematics questions designed by experts out of the box It's literally over lmao [image] Ben South / @bnj : o3: 87.5% Humans: 85% AGI confirmed [image] Alexander Doria / @dorialexander : Ah yes, OpenAI is really not messing around for the last day. Wild. [image] Justin Halford / @justin_halford_ : o3 got 87% on ARC Eval and 71% on SWE Bench (!) [image] @seconds_0 : Ok fine everything is forgiven @OpenAI o3 is an insane stepchange Bob McGrew / @bobmcgrewai : Congrats to the research team on the o3 and o3-mini announcements! These are great models. And, yes... you've reached a new high for OpenAI's tradition of truly terrible naming. 😂 Deedy / @deedydas : Summary of all the CRAZY benchmark results from OpenAI's most advanced model, o3! SWE-Bench Verified: 71.7% Codeforces rating: 2727 Competition Math: 96.7% PhD level science (GPQA): 87.7% Frontier Math: 25.2% (previous best was 2%) ARC-AGI: 87.5% TRULY SUPERHUMAN. [image] Fernando Rojo / @fernandotherojo : OpenAI o3 absolutely crushing on the arc-agi eval. wow sorry @jerber888 there's a new #1 [image] @sullyomarr : yeah its over for coding with o3 this is mindboggling looks like the first big jump since gpt4, because these numbers make 0 sense [image] Alexander Doria / @dorialexander : Community reaction is predictible. [image] Ethan Mollick / @emollick : Independent evaluations of OpenAI's o3 suggest that it passed benchmarks that were previously considered far out of reach for AI including achieving a score on ARC-AGI that was associated with actually achieving AGI (though the creators of the benchmark don't think it o3 is AGI) Tanishq Mathew Abraham, Ph.D. / @iscienceluvr : OpenAI also announces o3-mini! There are different thinking levels that can be chosen It's also quite fast, as demonstrated by their demos [image] Arun Vijayvergiya / @arunv30 : O3 is truly incredible. Can't wait for people to try it. Adi / @adonis_singh : o3 is a LARGE model Gary Marcus / @garymarcus : Three o3 predictions: - People will initially be amazed - Once they dive in, they will see that it is not reliable - It will work best in closed-domains (like math problems) but less reliably in open-ended domains (like everyday reasoning about the real world). @basedbeffjezos : o3 it is. [image] Karan Ganesan / @karanganesan : it's crazy we now have thinking time as a parameter for llm apis open ai o3, o3 mini soon what a time to be alive Matt Shumer / @mattshumer_ : These o3 evals are... absolutely insane Armand Domalewski / @armanddoma : uh, wow. o3 [image] @smokeawayyy : o3 scored 75.7% on ARC-AGI Public Benchmark within compute requirements. o3 scored 87.5% in High compute mode. Human performance is 85%. [image] Kevin Weil / @kevinweil : o3 with a major breakthrough on ARC-AGI 🤯 Alexander Doria / @dorialexander : It gets even wilder. [image] Yana Welinder / @yanatweets : o3 has scored over 85% on ARC-AGI 🤯 Human performance is at 85% In other words: AGI has been achieved in 2024 2025 is going to be wild! [image] Deedy / @deedydas : OpenAI o3 does 71.7% on SWE-Bench verified!! And 2727 on Codeforces! Unbelievable. [image] Nathan Lambert / @natolambert : OpenAI skips o2, previews o3 scores, and they're truly crazy. Huge progress on the few benchmarks we think are truly hard today. Including ARC AGI. Rip to people who say any of “progress is done,” “scale is done,” or “llms cant reason” 2024 was awesome. I love my job. [image] Gary Marcus / @garymarcus : Shocker: “we are not going to make these publicly available today” @modestproposal1 : “OpenAI's new o3 model represents a significant leap forward in AI's ability to adapt to novel tasks. This is not merely incremental improvement, but a genuine breakthrough, marking a qualitative shift in AI capabilities compared to the prior limitations of LLMs.” @arcprize : New verified ARC-AGI-Pub SoTA! @OpenAI o3 has scored a breakthrough 75.7% on the ARC-AGI Semi-Private Evaluation. And a high-compute o3 configuration (not eligible for ARC-AGI-Pub) scored 87.5% on the Semi-Private Eval. 1/4 [image] Jacob Andreou / @jacobandreou : ok well o3 is AGI hope everyone had fun [image] @apples_jimmy : Arc-agi BEATEN AT 87.5% !!! Alex Volkov / @altryne : 💥 Evals for @openai new frontier o3 are CRAZY 🤯 25.2 on Frontier Math 96.7 on GPQA 71.7 on SWE-bench verified SOTA on @arcprize - an insane 87.5% (on high compute) !! (Above human performance at 85% threshold!) [image] Tanishq Mathew Abraham, Ph.D. / @iscienceluvr : Breaking news! OpenAI announces o3! 🔥 Achieves new SOTA on ARC-AGI of 87.5%! Significant jump in the hardest frontier math benchmark (EpochAI) Achieves SOTA on other technical benchmarks like AIME and GPQA-Diamond [image] Vik / @vikhyatk : looks like we're only getting some evals today? when is o3 rolling out? Alex Volkov / @altryne : @OpenAI @arcprize o3-mini evals - with more thinking time, increasing ELO, outperforming o1 while being much cheaper. [image] LinkedIn: Douglas Hirsh : We are rapidly approaching the future I've been discussing since ChatGPT was first made public—a future where traditional coding may become obsolete. … Richard Searle : With the preview announcement by OpenAI of their o3 model class, we can reflect on 2024 as a year of extraordinary progress in AI. … Mike Lanzetta : I won't join the breathless speculation about whether this is or is not AGI - I'm old enough to remember when the AI goalposts would move every … Costa Kladianos : Big news in AI this week: OpenAI just announced their latest model, o3, and it's a game-changer. — The o3 model is designed to solve problems … Jan Beke : Oh, oh, oh indeed. We are excited to share o3, OpenAI's newest reasoning model. It's a massive leap forward, outperforming the toughest of benchmarks … Stefan Bauschard : AGI? OpenAI has released a new model that they claim (and independent evaluation was done) reached one of the most challenging benchmarks at its best. … Forums: r/artificial : One-Minute Daily AI News 12/20/2024 r/technews : OpenAI announces new o3 models | TechCrunch lobste.rs : OpenAI o3 Breakthrough High Score on ARC-AGI-Pub