OpenAI unveils o3 and o3-mini, trained to “think” before responding via what OpenAI calls a “private chain of thought”, and plans to launch them in early 2025

OpenAI announced its new o3 models on Friday. — In a tweet ahead of its final livestream for its …

TechCrunch 2024-12-21 Maxwell Zeff

Discussion

@wertwhile Joel Wertheimer on bluesky
OpenAI's new o3 model is apparently a big step forward (though still very expensive). It is a bit funny that one of the things AI will do is just replace a lot of software engineering. “Learn to code” maybe had it exactly backwards in terms of jobs of the future. techcrunch.com…
@benedictevans Benedict Evans on threads
It is fascinating to watch AI labs redefine ‘AGI’ as ‘a good model’ in real time
@fchollet François Chollet on x
Today OpenAI announced o3, its next-gen reasoning model. We've worked with OpenAI to test it on ARC-AGI, and we believe it represents a significant breakthrough in getting AI to adapt to novel tasks. It scores 75.7% on the semi-private eval in low-compute mode (for $20 per task […
@openai @openai on x
Day 12: Early evals for OpenAI o3 (yes, we skipped a number) https://openai.com/...
@polynoamial Noam Brown on x
We announced @OpenAI o1 just 3 months ago. Today, we announced o3. We have every reason to believe this trajectory will continue. [image]
@mikeknoop Mike Knoop on x
o3 is really special and everyone will need to update their intuition about what AI can/cannot do. while these are still early days, this system shows a genuine increase in intelligence, canaried by ARC-AGI semiprivate v1 scores: * GPT-2 (2019): 0% * GPT-3 (2020): 0% * GPT-4
@levie Aaron Levie on x
OpenAI just announced o3, their new reasoning model that appears to perform insanely well across benchmarks. There are simply no signs of a slow down in AI right now. [image]
@deedydas Deedy on x
OpenAI o3 is 2727 on Codeforces which is equivalent to the #175 best human competitive coder on the planet. This is an absolutely superhuman result for AI and technology at large. [image]
@fchollet François Chollet on x
So, is this AGI? While the new model is very impressive and represents a big milestone on the way towards AGI, I don't believe this is AGI — there's still a fair number of very easy ARC-AGI-1 tasks that o3 can't solve, and we have early indications that ARC-AGI-2 will remain ext…
@gdb Greg Brockman on x
o3, our latest reasoning model, is a breakthrough, with a step function improvement on our hardest benchmarks. we are starting safety testing & red teaming now.
@balajis Balaji on x
The Frontier Math benchmark is challenging for Fields Medalists. The state of the art stood at 2% just 2 months ago. It has now been shattered by o3. [image]
@fchollet François Chollet on x
One very important thing to understand about the future: the economics of AI are about to change completely. We'll soon be in a world where you can turn test-time compute into competence — for the first time in the history of software, marginal cost will become critical.
@emostaque Emad on x
Any work that can be done on the other side of a computer screen, AI will be able to do at a fraction of the price It's not even about creativity and coming up with recipes like a chef or novel code like a distinguished engineer AI will follow guides better, superior cook 🧑‍🍳
@wgussml William on x
o3 enables OpenAI to release a gpt4.5 trained on synthetic data from o1/3 that outperforms o1 without undercutting itself except o3-mini performs at cost/speed par with gpt-4o I've said this the whole time: gpt-5 is the friends we've made along the way
@iterintellectus Vittorio on x
serious question what should a CS student (or any knowledge worker for that matter) do at this point? even if the model is $2000/month, it's still cheaper than a graduate employee what's the plan now？ [image]
@emostaque Emad on x
My take on o3: the global economy is cooked, we need a new economic and societal framework.
@dylan522p Dylan Patel on x
“At first you go reasoning slowly, then all at once.” - Noam Browningway
@byrnehobart Byrne Hobart on x
Pretty good odds in my opinion that come Monday morning, a few people are going to realize that they took a comically expensive early vacation. Maybe I'm wrong, but this seems light a straightforward victory for team trendlines-on-log-graphs.
@levie Aaron Levie on x
The step function leap in capability for OpenAI's o3 model looks insane. And importantly, the cost per task will inevitably go down precipitously with hardware and model improvements over time. [image]
@sama Sam Altman on x
seemingly somewhat lost in the noise of today: on many coding tasks, o3-mini will outperform o1 at a massive cost reduction! i expect this trend to continue, but also that the ability to get marginally more performance for exponentially more money will be really strange.
@dylan522p Dylan Patel on x
Motherfuckers were market buying Nvidia stock cause OpenAI O3 is so fucking good
@andrewcurran_ Andrew Curran on x
The training time between o1 and o3 was only three months, which means o4 is on track for March 2025. Look at the jump between these two generations - if that kind of progress continues every three months, it will be hard to keep calling this a slow takeoff.
@abacaj Anton on x
open source models are kind of cooked, if it takes this much compute to get the right answer for complex questions there's no shot you can run that “locally”
@byrnehobart Byrne Hobart on x
Take the Nvidia bear case where OpenAI does this entirely on their own silicon. You still have to assume that Jensen has already gotten phone calls from people collectively representing $100bn+ in annual capex.
@byrnehobart Byrne Hobart on x
Of all the weird things this market cycle, Nvidia trading at roughly the same price it was when the OpenAI announcement hit has to be the weirdest.
@swyx @swyx on x
I haven't seen enough people draw lines on the o3-mini chart. OpenAI has found a SECOND scaling law and it is roughly 3x steeper than the o-full models what the eff folks [image]
@sherwinwu Sherwin Wu on x
Lots of buzz around the o3 ARC-AGI result, but the AIME / Codeforces results are a lot more meaningful to me personally. As someone who spent basically all of middle/high school tryharding at competition math - seeing o3 blow past my best showings is... humbling to say the least …
@amasad Amjad Masad on x
Based on benchmarks, OpenAI's o3 seems like a genuine breakthrough in AI. Maybe a start of a new paradigm. But what new is also old: under the hood it might be Alpha-zero-style search and evaluate. The author of ARC-AGI benchmark @fchollet speculates on how it works: [image]
@amasad Amjad Masad on x
I bet Google will fast follow o3 because Demis talked about “AlphaZero-mechanism on top of LLMs” back in February.
@alkalinesec @alkalinesec on x
o1 appears to easily identify the issue in this code. others have also noted it can correctly determine the input to crash crackaddr. i have also made small modifications of this code and crackaddr to try to trip it up. it still gets it right. [image]
@brian_armstrong Brian Armstrong on x
Mad respect for OpenAI's progress, but if this version naming is some sort of IQ test, I am stumped on what comes next in the sequence.... 4o -> o1 -> o3 -> ?
@kevinroose Kevin Roose on x
this is (I think?) a joke, but man it is hard to impress upon non-SF, non-tech people how much the vibe has shifted here and how short timelines are, even people who used to be skeptical of AI progress twice this month I've been asked at a party, “are you feeling the AGI?”
@venturetwins Justine Moore on x
“They're calling 12 Days of Shipmas a flop? Bring out the AGI.” [image]
@amir Amir Efrati on x
I think OpenAI's reason for naming it o3 means it can no longer be considered a startup 😀
@creatine_cycle Atlas on x
the cycle is as follows: >new OpenAI Release >CS majors put on suicide watch and welding courses are saturated >wait this thing is woke >agi definition changes >deloitte headcount grows by 3% >new OpenAI Release
@garrytan Garry Tan on x
Codegen of this quality in the hands of everyone with a computer is a revelation
@avischiffmann Avi on x
The only reason OAI doesn't call o3 AGI is because they must want to continue their Microsoft partnership
@norabelrose Nora Belrose on x
If OpenAI's new o3 model is “successfully aligned,” then it could probably be trusted to supervise more powerful models, allowing us to bootstrap to benevolent superintelligence.
@stevenheidel Steven Heidel on x
it's beginning to look a lot like AGI🎄
@dbreunig Drew Breunig on x
What can we take-away from Dec's LLM blitz and o3's arrival? 1. The best models will think longer 2. We're gonna need more reasoning training data 3. There will be an increased focus on inference in 25 4. Builders need to stay flexible; a cheaper or better model arrives tomorrow
@kevinweil Kevin Weil on x
Day 12: ✨ o3 ✨ o3 is bonkers good, a massive step up from o1 on every one of our hardest benchmarks. It's not ready yet (we're starting safety testing, including with external red teamers) but I'm excited for all of the new products it'll enable when it launches.
@krishnanrohit Rohit on x
The continuing dominance of AI capabilities are going to further exacerbate the bimodal distribution of outcomes. Capital-havers will dominate as price of labour is lowered. Self-starters will dominate as prize from experimentation increases. Middle gets squeezed.
@aravsrinivas Aravind Srinivas on x
The biggest winner of the o3 announcement [image]
@dbreunig Drew Breunig on x
A big question for me is: with o3-class teacher models, how good can we make lightweight models?
@benjedwards Benj Edwards on x
I think we should call o1-style AI models “simulated reasoning” or “SR models,” since they don't reason like humans, but they do simulate a type of artificial reasoning process that can produce useful results https://arstechnica.com/...
@scobleizer Robert Scoble on x
OpenAI sets a high bar for 2025. Grok will exceed it. I studied @DickFosbury1. He told me he started worst on his high school track team at the sport of high jumping. His name is now on the sport because he came up with a new way to jump.
@sullyomarr @sullyomarr on x
Btw I don't say this lightly but SWE in the traditional sense is dead in < 2 years You will still need smart, capable engineers But anything that involves raw coding and no taste is done for o6 will build you basically anything
@_aidan_clark_ Aidan Clark on x
@ren_hongyu killed it To recap the demo (I'm still sweating), o3-mini wrote it's own ChatGPT UI to talk to *itself* via the OpenAI API, we asked o3-mini to write and execute a script in this UI to evaluate *itself* on GPQA, and the resulting script correctly returned 61%. [image]
@dorialexander Alexander Doria on x
Good show overall. OpenAI does frontier, Google industrial/product scaling and this will trickle down on the entire ecosystem. 2025 looks fine.
@inafried Ina Fried on x
Here's our story on o3- the new @OpenAI model that has insiders and outsiders citing as proof that generative Ai has not hit a scaling wall. https://www.axios.com/...
@garymarcus Gary Marcus on x
The fan boys who are declaring victory now clearly never went to graduate school, where you learn to pick apart a bunch of graphs and ask hard questions. Like, what does the top left graph here tell us? [image]
@krishnanrohit Rohit on x
I want a job at openai now purely to play with o3
@garymarcus Gary Marcus on x
Excuse me but today was not the mic-drop AGI moment. That moment *will* come. Not soon.
@garymarcus Gary Marcus on x
Exactly. what was saw today was GPT-4.15, not GPT 5. People were promising me AGI this morning; what I saw ain't that.
@miles_brundage Miles Brundage on x
Does this mean humans will be able to add 0 value in these areas? Not necessarily - knowing the problem to solve requires insight/data from other domains, and it may be like chess/Go where there's a centaur period where humans can *occasionally* help even if weaker head-to-head.
@garymarcus Gary Marcus on x
Nice try, but this is goal post shifting. when I proposed the “wall” I didn't talk about ARC. I talked about hallucinations, compositionality, boneheaded errors, etc. Today's demo didn't address any of those and was very light on applications outside of benchmarks in closed
@dmdohan David Dohan on x
imo the improvements on FrontierMath are even more impressive than ARG-AGI. Jump from 2% to 25% Terence Tao said the dataset should “resist AIs for several years at least” and “These are extremely challenging. I think that in the near term basically the only way to solve them,
@krishnanrohit Rohit on x
[image]
@vikhyatk Vik on x
openai spent more money to run an eval on arc-agi than most people spend on a full training run
@krishnanrohit Rohit on x
Good work OpenAI on dropping a banger on Day 12. Still got the fight!
@krishnanrohit Rohit on x
Calling time that open source models will soon enough get to this benchmark too, they know the path now
@garymarcus Gary Marcus on x
From what I can tell from the o3 announcement, all three of these predictions will prove to be correct. The ARC results were particularly impressive. But note that there were no demos around everyday reasoning and that the strongest results were around math problems and coding,
@qivshi1 @qivshi1 on x
O3 will be cheap if we just have the will to build the servers...
@aj_kourabi @aj_kourabi on x
o3 does 25.2% on Frontier Math. Previous models barely got 2%. Here are some sample questions. It is a hard eval (and unpublished). Progress is not slowing down. [image]
@jtlin Justin Lin on x
Wow, o3 is quite the “holiday update” from @OpenAI. Like @Tesla FSD v13, can't wait to try. But sounds like it might be a while before general availability for both. 😞
@sebastienbubeck Sebastien Bubeck on x
o3 and o3-mini are my favorite models ever. o3 essentially solves AIME (>90%), GPQA (~90%), ARC-AGI (~90%), and it gets 1/4th of the Frontier Maths. To understand how insane 25% on Frontier Maths is, see this quote by Tim Gowers. The sparks are intensifying ... [image]
@itsallentropy Essam on x
seems like o3 can solve novel visual problems now, pretty cooooool.
@aj_kourabi @aj_kourabi on x
o3 results robustly showcase how the piece I helped write with semianalysis is right on so many of the critical topics, take a look if you have not already https://semianalysis.com/...
@ryanmorrisonjer Ryan Morrison on x
o3 will develop and train o4. This is the start of the takeover.
@vikhyatk Vik on x
have to say i like this shift from “intelligence too cheap to meter” [image]
@modestproposal1 @modestproposal1 on x
can beat the machines for now 🦾
@0xkarmatic Karma on x
Resurfacing this for you people after looking at the o3 evals results
@sporadicalia @sporadicalia on x
December 5th: o1 released December 20th: o1 made obsolete by o3 is this what acceleration feels like?
@krishnanrohit Rohit on x
Google, please drop Gemini Experimental 1206 Thinking model, make this a race!
@yuris Yuri Sagalov on x
“OpenAI's new o3 model represents a significant leap forward in AI's ability to adapt to novel tasks. This is not merely incremental improvement, but a genuine breakthrough, marking a qualitative shift in AI capabilities compared to the prior limitations of LLMs.” @fchollet [imag…
@benbajarin Ben Bajarin on x
Is the thinking, at the moment, these long reasoning models are primarly an enterprise thing? Because if we are going to be paying too much for agent work consumers are not going to use them.
@sahir2k Saint on x
yo @sama gimme o3 access , i'm a safety researcher ( i'm interested in my job safety)
@silasalberti Silas Alberti on x
o3 scores a 2727 ELO on Codeforces which places it 175th in the global ranking. That's better than ~99.9% of humans on the website (who already tend to be far above average). [image]
@altryne Alex Volkov on x
o3-mini will support everything that o1 supports, function calling, structured outputs while being significantly cheaper! [image]
@krishnanrohit Rohit on x
Holy shit they did it! Never been more vindicated that AI has not hit a wall ... [image]
@apples_jimmy @apples_jimmy on x
Openai has the Mandate of Heaven.
@sparkycollier @sparkycollier on x
paging @sundarpichai please report to the lab
@_fabknowledge_ @_fabknowledge_ on x
Jesus the price
@mckaywrigley Mckay Wrigley on x
There is absolute no situation in which you will outcompete someone who is using o3 and you are not. This clearly seems like the model that will begin to actually spark a real AGI debate. Based on the numbers they're showing today? Not sure I'd argue against it.
@rowancheung Rowan Cheung on x
NEW: OpenAI just announced ‘o3’, a breakthrough AI model that significantly surpasses all previous models in benchmarks. —On ARC-AGI: o3 more than triples o1's score on low compute and surpasses a score of 87% —On EpochAI's Frontier Math: o3 set a new record, solving 25.2% of [im…
@miles_brundage Miles Brundage on x
I've been saying recently that completely superhuman AI math and coding by end of 2025 was plausible - 50/50 or so. Now I'd say it's much more likely than not (o3 is already better than almost all humans).
@dmdohan David Dohan on x
o3 @ 87.5% on ARC-AGI It was 16 hours at an increase rate of 3.5% an hour to “solved” [image]
@mckaywrigley Mckay Wrigley on x
@samirayubkhan @OpenAI Someone told me they'd be demoing a crazy Arc result (which they did). Got 87% on o3 high reasoning. Said they're going to continue to build improved version (tldr).
@bnj Ben South on x
o3: 87.5% Humans: 85% AGI confirmed [image]
@dorialexander Alexander Doria on x
Ah yes, OpenAI is really not messing around for the last day. Wild. [image]
@nim_chimpsky_ Nim on x
o3 can solve 25% of research level mathematics questions designed by experts out of the box It's literally over lmao [image]
@justin_halford_ Justin Halford on x
o3 got 87% on ARC Eval and 71% on SWE Bench (!) [image]
@fernandotherojo Fernando Rojo on x
OpenAI o3 absolutely crushing on the arc-agi eval. wow sorry @jerber888 there's a new #1 [image]
@seconds_0 @seconds_0 on x
Ok fine everything is forgiven @OpenAI o3 is an insane stepchange
@bobmcgrewai Bob McGrew on x
Congrats to the research team on the o3 and o3-mini announcements! These are great models. And, yes... you've reached a new high for OpenAI's tradition of truly terrible naming. 😂
@deedydas Deedy on x
Summary of all the CRAZY benchmark results from OpenAI's most advanced model, o3! SWE-Bench Verified: 71.7% Codeforces rating: 2727 Competition Math: 96.7% PhD level science (GPQA): 87.7% Frontier Math: 25.2% (previous best was 2%) ARC-AGI: 87.5% TRULY SUPERHUMAN. [image]
@sullyomarr @sullyomarr on x
yeah its over for coding with o3 this is mindboggling looks like the first big jump since gpt4, because these numbers make 0 sense [image]
@dorialexander Alexander Doria on x
Community reaction is predictible. [image]
@iscienceluvr Tanishq Mathew Abraham, Ph.D. on x
OpenAI also announces o3-mini! There are different thinking levels that can be chosen It's also quite fast, as demonstrated by their demos [image]
@arunv30 Arun Vijayvergiya on x
O3 is truly incredible. Can't wait for people to try it.
@garymarcus Gary Marcus on x
Three o3 predictions: - People will initially be amazed - Once they dive in, they will see that it is not reliable - It will work best in closed-domains (like math problems) but less reliably in open-ended domains (like everyday reasoning about the real world).
@emollick Ethan Mollick on x
Independent evaluations of OpenAI's o3 suggest that it passed benchmarks that were previously considered far out of reach for AI including achieving a score on ARC-AGI that was associated with actually achieving AGI (though the creators of the benchmark don't think it o3 is AGI)
@adonis_singh Adi on x
o3 is a LARGE model
@basedbeffjezos @basedbeffjezos on x
o3 it is. [image]
@karanganesan Karan Ganesan on x
it's crazy we now have thinking time as a parameter for llm apis open ai o3, o3 mini soon what a time to be alive
@mattshumer_ Matt Shumer on x
These o3 evals are... absolutely insane
@armanddoma Armand Domalewski on x
uh, wow. o3 [image]
@smokeawayyy @smokeawayyy on x
o3 scored 75.7% on ARC-AGI Public Benchmark within compute requirements. o3 scored 87.5% in High compute mode. Human performance is 85%. [image]
@kevinweil Kevin Weil on x
o3 with a major breakthrough on ARC-AGI 🤯
@dorialexander Alexander Doria on x
It gets even wilder. [image]
@yanatweets Yana Welinder on x
o3 has scored over 85% on ARC-AGI 🤯 Human performance is at 85% In other words: AGI has been achieved in 2024 2025 is going to be wild! [image]
@natolambert Nathan Lambert on x
OpenAI skips o2, previews o3 scores, and they're truly crazy. Huge progress on the few benchmarks we think are truly hard today. Including ARC AGI. Rip to people who say any of “progress is done,” “scale is done,” or “llms cant reason” 2024 was awesome. I love my job. [image]
@deedydas Deedy on x
OpenAI o3 does 71.7% on SWE-Bench verified!! And 2727 on Codeforces! Unbelievable. [image]
@garymarcus Gary Marcus on x
Shocker: “we are not going to make these publicly available today”
@jacobandreou Jacob Andreou on x
ok well o3 is AGI hope everyone had fun [image]
@altryne Alex Volkov on x
💥 Evals for @openai new frontier o3 are CRAZY 🤯 25.2 on Frontier Math 96.7 on GPQA 71.7 on SWE-bench verified SOTA on @arcprize - an insane 87.5% (on high compute) !! (Above human performance at 85% threshold!) [image]
@modestproposal1 @modestproposal1 on x
“OpenAI's new o3 model represents a significant leap forward in AI's ability to adapt to novel tasks. This is not merely incremental improvement, but a genuine breakthrough, marking a qualitative shift in AI capabilities compared to the prior limitations of LLMs.”
@arcprize @arcprize on x
New verified ARC-AGI-Pub SoTA! @OpenAI o3 has scored a breakthrough 75.7% on the ARC-AGI Semi-Private Evaluation. And a high-compute o3 configuration (not eligible for ARC-AGI-Pub) scored 87.5% on the Semi-Private Eval. 1/4 [image]
@apples_jimmy @apples_jimmy on x
Arc-agi BEATEN AT 87.5% !!!
@vikhyatk Vik on x
looks like we're only getting some evals today? when is o3 rolling out?
@altryne Alex Volkov on x
@OpenAI @arcprize o3-mini evals - with more thinking time, increasing ELO, outperforming o1 while being much cheaper. [image]
@iscienceluvr Tanishq Mathew Abraham, Ph.D. on x
Breaking news! OpenAI announces o3! 🔥 Achieves new SOTA on ARC-AGI of 87.5%! Significant jump in the hardest frontier math benchmark (EpochAI) Achieves SOTA on other technical benchmarks like AIME and GPQA-Diamond [image]
r/artificial r on reddit
One-Minute Daily AI News 12/20/2024
@garymarcus Gary Marcus on bluesky
How many times do we have to see this same movie, where an AI beats some benchmark and influencers gleefully shout “It's So Over” without even trying out the AI and then on careful inspection the AI turns out to not be robust or reliable? — Thousands? — (It's already been hun…
@simonwillison.net Simon Willison on bluesky
By far the best coverage of o3 is this essay by François Chollet, it's crammed with interesting insights beyond just reporting on the benchmark score: arcprize.org/blog/oai-o3-... Published my own notes on that here: simonwillison.net/2024/Dec/20/ ...
@julianharris Julian Harris on bluesky
OpenAI announced o3 that is significantly better than previous systems, according to an independent benchmark org (The Arc Prize) that apparently got access. — Only thing is it's wildly wildly expensive to run. Like its top end system is around $10k per TASK. — arcprize.org/…
@zachweinersmith Zach Weinersmith on bluesky
This seems like a really big deal? arcprize.org/blog/oai-o3-... Chollet was pretty skeptical this would happen soon, even as of a few months ago. — Skeptics, whatcha got?
@stevenheidel.com Steven Heidel on bluesky
it's beginning to look a lot like AGI 🎄 arcprize.org/blog/oai-o3-...
@luokai @luokai on threads
Even though o3 has tackled many challenges that most average humans can't solve, there's still a long road ahead on the journey to AGI. It still messes up on tasks that are pretty simple for humans, revealing a huge gap between where it is and true AGI. …
@mergesort Joe Fabisevich on threads
Something OpenAI has realized in a way no other foundation model lab has is that your model can be amazing, superhuman even, but if it's not packaged into a product people can use it might as well not exist. ChatGPT supporting Apple Notes is worth far more to people than 100 mor…
@moskov Dustin Moskovitz on threads
Francois Chollet has long been an LLM skeptic - he seems to be coming around. I wonder if others will follow?
@crumbler Casey Newton on threads
“All intuition about AI capabilities will need to get updated for o3,” says the founder of the ARC prize — designed to be very difficult for LLMs to solve — after OpenAI took the benchmark from 5% with o1 to 85% today with o3 https://arcprize.org/...
@taulogicai Tau on x
@GaryMarcus @fchollet Well said. Passing ARC-AGI highlights progress in solving specific challenges, not achieving AGI. True AGI requires reasoning, adaptability, and guaranteed correctness across all tasks, not just isolated benchmarks. The distinction cannot be overstated.
@lennysan Lenny Rachitsky on x
The best benchmark for tracking progress towards AGI. 2025 is going to be wild.
@fchollet François Chollet on x
Deep learning did hit that wall, and the natural answer to get past it was deep learning plus search. AI research is about to enter its deep-learning guided program synthesis (or CoT synthesis) arc.
@garymarcus Gary Marcus on x
Important words from @fchollet: “it is important to note that ARC-AGI is not an acid test for AGI - as we've repeated dozens of times this year. It's a research tool designed to focus attention on the most challenging unsolved problems in AI, a role it has fulfilled well over
@tszzl Roon on x
“easy for humans, hard for ai” is not a solid design principle for evals imo it leads you towards “judging a fish by how far it can climb a tree” absurdities but maybe it's one orthogonal eval style among many equally important ones
@denny_zhou Denny Zhou on x
any benchmark—including ARC-AGI—can be rapidly solved, as long as the task provides a clear evaluation metric that can be used as a reward signal during fine-tuning.
@eshear Emmett Shear on x
If you haven't figured out the joke yet, *any* fixed benchmark will fall rapidly the instant it becomes an optimization target for the top labs. Correspondingly, no fixed benchmark is AGI. AGI is the ability to generalize to an adversarially chosen new benchmark.
@garymarcus Gary Marcus on x
Starting to feel like most people using the term AGI have lost sight of what the G actually stands for.
@burkov Andriy Burkov on x
How to achieve AGI in 2024: 1. Define a benchmark with puzzles and call it “The AGI Testing Benchmark.” 2. Fine-tune a VLM to solve these puzzles. 3. Declare the AGI achievement.
@sauers_ Sauers on x
The total compute cost was around $1,600,250, more than the entire prize [image]
@geofflewisorg Geoff Lewis on x
Think about what becomes *more* valuable in a post-AI world, for we are at its doorstep.
@tszzl Roon on x
specifically arc agi visual tasks look like nonsense in JSON format and multi modality isn't great and the character manipulation tasks don't work for the same reason models mess up the how many “r"s in strawberry problem (tokenization/BPE)
@tszzl Roon on x
it's almost adversarially constructed wrt input modalities. for a model to solve these requires a far higher level of intelligence than equivalent human score
@fchollet François Chollet on x
It will also be extremely important to analyze the strengths and limitations of the new system. Here are some examples of tasks that o3 couldn't solve on high-compute settings (even as it was generating millions of CoT search tokens and consuming thousands of dollars of compute […
@miaai_builder Mia Bookworm on x
@Techmeme The o3 model's ability to generate and execute its own programs is a massive leap forward in AI capabilities, I'm loving the potential for novel task adaptation!
@fchollet François Chollet on x
The limitations of specific techniques are predictable and correspondingly lead to plateaus for those techniques. But there is always the next technique, building on top of the pile that's already available. There is enough research investment that there will be no wall.
@miles_brundage Miles Brundage on x
I'm old enough to remember when getting double digit scores on FrontierMath was considered super hard I'm 6 weeks old
@simonw Simon Willison on x
Absolutely the most interesting think I've read so far about o3 is this essay by @fchollet https://arcprize.org/... [image]
@dbreunig Drew Breunig on x
Creating reasoning data for training - by humans or synthetically - will continue to rise in importance.
@fchollet François Chollet on x
What does this mean for the future of AGI research? For me, the main open question is where the scaling bottlenecks for the techniques behind o3 are going to be. If human-annotated CoT data is a major bottleneck, for instance, capabilities would start to plateau quickly like
@fchollet François Chollet on x
Two other examples. You can find the full testing data here: https://github.com/... If the topic interests you, take a look at analyzing this data. [image]
r/agi r on reddit
OpenAI o3 Breakthrough High Score on ARC-AGI-Pub
r/mlscaling r on reddit
OpenAI o3 Breakthrough High Score on ARC-AGI-Pub
r/MachineLearning r on reddit
[D] OpenAI o3 87.5% High Score on ARC Prize Challenge
r/singularity r on reddit
FULL O3 TESTING REPORT
@justicar.xyz Glenn White on bluesky
“Out of respect” for not getting their asses sued to hell and back. [embedded post]
@quinnypig.com Corey Quinn on bluesky
And if there's one thing OpenAI is renowned for, it's respecting the intellectual property of others. [embedded post]
@aquariusacquah Ken on x
as it was written in the ancient texts [image]
@fchollet François Chollet on x
Cost-efficiency will be the overarching measure guiding deployment decisions. How much are you willing to pay to solve X? The world is once again going to run out of GPUs.
@sama Sam Altman on x
if you are a safety researcher, please consider applying to help test o3-mini and o3. excited to get these out for general availability soon. extremely proud of all of openai for the work and ingenuity that went into creating these models; they are great.
r/slatestarcodex r on reddit
OpenAI Unveils More Advanced Reasoning Model in Race With Google

Chronicles

OpenAI unveils o3 and o3-mini, trained to “think” before responding via what OpenAI calls a “private chain of thought”, and plans to launch them in early 2025

Related Coverage

Discussion