A comparison of OpenAI's o3, o4-mini, and GPT-4.1; Aaron Levie says o3 nailed a multi-step financial modeling task; Scale AI CEO says o3 is “a big breakthrough”
Our take on what's powerful, what's practical, and what's still TBD … If you've been following AI news this week …
Every
Related Coverage
- Harvard Will Survive — But will CA public universities survive ChatGPT o3...? Anecdotal Value · Hollis Robbins
- On Jagged AGI: o3, Gemini 2.5, and everything after One Useful Thing · Ethan Mollick
- OpenAI o3 and o4-Mini Are More Impressive Than I Expected The Algorithmic Bridge · Alberto Romero
- o3 Will Use Its Tools For You Don't Worry About the Vase · Zvi Mowshowitz
- OpenAI o3 and o4-mini System Card OpenAI
- Investigating truthfulness in a pre-release o3 model Transluce
- OpenAI's o3 and o4-mini Models Can Now Analyze Images Like a Human MSPoweruser · Abhijay Singh Rawat
- Weekly Tech Recap: OpenAI releases o3 and o4 mini AI models, Samsung's One UI 7 drama continues and more Livemint · Aman Gupta
- OpenAI's new reasoning models see rise in hallucination rates Tech in Asia · Minh Le
- Smarter, but less accurate? ChatGPT's hallucination conundrum The Economic Times
- All the AI news of the week: ChatGPT debuts o3 and o4-mini, Gemini talks to dolphins Mashable · Cecily Mauran
- OpenAI New o3/o4-mini Models Hallucinate More Than Previous Models WinBuzzer · Markus Kasanmascheff
- Why software developers need to watch out for package hallucinations IT Brew · Brianna Monsanto
- Breakthroughs, Concerns in OpenAI's Latest Lineup DeviceSecurity.io · Rashmi Ramesh
- OpenAI details ChatGPT-o3, o4-mini, o4-mini-high usage limits BleepingComputer · Mayank Parmar
- OpenAI's New AI Models o3 and o4-mini Can Now ‘Think With Images’ TechRepublic · Aminu Abdullahi
- “OpenAI found that o3 hallucinated in response to 33% of questions on PersonQA, the company's in-house benchmark for measuring the accuracy of a model's knowledge about people. That's roughly double the hallucination rate of OpenAI's previous reasoning models, o1 and o3-mini, which scored 16% and 14.8%, respectively. … @aulia@mementomori.social · Aulia Masna
- OpenAI's more advanced reasoning models hallucinate MORE than older models. o4-mini hallucinates 48% of the time per one internal benchmark test. … Hari Stephen Kumar
- I had a great chat with Maxwell Zeff from TechCrunch about how we've been using o3 and o4-mini. I wanted to share three quick thoughts from that conversation: … Kian Katanforoosh
- OpenAI's new reasoning AI models hallucinate more Hacker News
- OpenAI Puzzled as New Models Show Rising Hallucination Rates Slashdot · Msmash
Discussion
-
@emollick
Ethan Mollick
on x
The geoguessing power of o3 is a really good sample of its agentic abilities. Between its smart guessing and its ability to zoom into images, to do web searches, and read text, the results can be very freaky. I stripped location info from the photo & prompted “geoguess this” [ima…
-
@deedydas
Deedy
on x
o3 really blew my mind with this one. I gave it an image of a menu of my favorite Chinese place in SF with no title or EXIF data, and it was able to search the web, match menu items, and locate it. 🤯 [image]
-
@smokeawayyy
@smokeawayyy
on x
o3 uses way too many tables. It's almost unreadable in the app and you can't copy/paste the output using either ‘Copy’ or ‘Select Text’. Seems like an oversight. [image]
-
@natolambert
Nathan Lambert
on x
o3's weird hallucinations could indicate they used llm as a judge (or other softer verifiers) in high volume and in addition to math/code correctness. This addition lets OpenAI scale RL by making more data available to train on, but has new downstream problems to solve.
-
@levie
Aaron Levie
on x
Here's why the latest reasoning models like OpenAI's o3 are going to make a world of difference for AI Agents in the enterprise. This new generation of models offers huge leaps in math, logic, and coding capabilities, which are incredibly important for advanced enterprise [video]
-
@tunguz
Bojan Tunguz
on x
“o3 is AGI” [image]
-
@deanwball
Dean W. Ball
on x
I went from never using chatgpt scheduled tasks to having ~20 with o3. News roundups on arbitrarily niche topics delivered at custom intervals. This alone is worth the money.
-
@bentossell
Ben Tossell
on x
nvm o3 did this without any mcps needed
-
@tweetatpablo
Pablo Arredondo
on x
GPT-4 passed the bar and is great at doc review. o3 is capable of engaging a full litigation record and creating MSJ outline with granular citations mapped to specific elements of cause of action. It is getting surreal.
-
@mattshumer_
Matt Shumer
on x
Prompt: “o3, from first principles, develop a detailed, insanely-well-researched and well-reasoned prediction for the continuous progress of AI over the next 10 years.” [image]
-
@kaseyklimes
Kasey
on x
o3 feels like the first model to cross a threshold where it's so smart that I sometimes don't understand wtf it's talking about
-
@daveshapi
David Shapiro
on x
o3 full is legitimately the most exciting innovation in AI to me since... probably ChatGPT itself. I say this as someone who was finetuning GPT-2 and GPT-3 before the chatbot era. o3 is a step change in the same magnitude that ChatGPT was in terms of UX and instrumental
-
@ivanfioravanti
Ivan Fioravanti
on x
o3 is simply amazing. I can't imagine the power of o3 pro and anything beyond that.
-
@emollick
Ethan Mollick
on x
“o3, make me a movie i can download that involves an otter and an airplane. figure out how to do it with the tools you have.” o3 has no movie capability, so It improvises decides to draw each frame and then stitch them together into a GIF to download, this was all first shot [ima…
-
@littmath
Daniel Litt
on x
Maybe worth stressing, since I think it may have been lost in the somewhat long thread below—I think the latest OpenAI and Google models (o3/o4-mini/Gemini 2.5 pro) are genuinely useful for some math research tasks.
-
@grantslatton
Grant Slatton
on x
a few days ago i was singing claude-code's praises now after using o3 for a few days, i find it intolerably dumb much to consider
-
@timsoret
Tim Soret
on x
Ok, O3 is the first LLM actually smarter & more knowledgeable than me on technical topics I deeply master. It's a paradigm change. Other LLMs required heavy guidance if I wanted to go that deep, and often couldn't follow if I went into uncharted territories.
-
@kwharrison13
Kyle Harrison
on x
Talking to o3 is like talking to your friend who takes the time to pause before answering any question.
-
@gregmushen
Greg Mushen
on x
Holy smokes o3 is good. It perfectly answered two of my questions that aren't anywhere in the literature, just through pure mechanistic reasoning.
-
@emollick
Ethan Mollick
on x
o3 is far more agentic than people realize. Worth playing with a lot more than a typical new model. You can get remarkably complex work out of a single prompt. It just does things. (Of course, that makes checking its work even harder, especially for non-experts.) [image]
-
@tenreirodaniel
Daniel Tenreiro
on x
if i tell o3 to use it's memory it makes every response a finance analogy [image]
-
@gregkamradt
Greg Kamradt
on x
o3 is the gpt-3.5 > 4 jump we've been waiting for My guess is that the jump to o3-pro won't feel *smarter* but it's going to handle way more long context nuance and complexity. You're duplicating the same person multiple times and giving them longer to think. Same intelligence,
-
@amasad
Amjad Masad
on x
o3 spends way too much time browsing the web
-
@kantrowitz
Alex Kantrowitz
on x
Okay, o3 is insanely good. Crushing past tests I've given earlier models.
-
@datachaz
Charly Wargnier
on x
I must've tested every multimodal LLM out there. None could geolocate this cloudy photo I took from the Pyrenean valley I live in. @OpenAI o3 did. It read the topography, vegetation, architecture, then found the exact spot among 100's of valleys in the Pyrenees! Just stunning. [v…
-
@aidan_mclau
Aidan McLaughlin
on x
really good summary of o3's strengths [image]
-
@danshipper
Dan Shipper
on x
one of my favorite o3 use cases: mini courses! it can design a course and then use its “reminders” tool to give you a new lesson every day. i had it make me a mini ML course, and i love using it every day. it's a great example of why tools are such a powerful way to help you [ima…
-
@nabeelqu
Nabeel S. Qureshi
on x
getting o3 to do the most menial and dumb things for you feels a bit like having Einstein as your butler
-
@danshipper
Dan Shipper
on x
o3 is clearly smarter than me on any given question, but i'm still smarter at knowing when to ask a question and what question to ask
-
@krishnanrohit
Rohit
on x
one annoying thing about o3 is the new personality, where it constantly tries to act like fratboy einstein. it's grating..
-
@daniel_mac8
Dan Mac
on x
the era of giving an LLM a “prompt” and getting a “response” is over. today's LLMs are NOT chatbots!!! to get the most out of o3, give it: 1. a goal 2. success criteria in return, what you get back is ***cognitive work*** cognitive work towards the stated goal, verified [image]
-
@kirkegaardemil
Emil Kirkegaard
on x
AIs are making rapid progress on figure reasoning tests too. O3 gets about 116 IQ, compared to about a year ago, all AIs were below 90. [image]
-
@spencerkschiff
Spencer Schiff
on x
It looks like o3 can reason across long context better than any other model, including 2.5 pro! I went over one of this benchmark's example questions a few weeks ago and it seems to be testing for the real deal, not just basic recall. [image]
-
@eigenrobot
@eigenrobot
on x
where LLMs will first make contributions to natural science, by o3 >tl;dr early wins where nature is lego‑like, data are fat, and the feedback loop can be automated. clock starts... well, it already did. [image]
-
@krishnanrohit
Rohit
on x
Strikes me that o3, if released last year or the year before, would've been seen as extraordinarily dangerous in almost every way
-
@seconds_0
@seconds_0
on x
I have invented a new insane visual eval for O3 [image]
-
@aaditsh
Aadit Sheth
on x
This guy literally built a 3D game with o3 in 10 minutes (no coding needed) [video]
-
@miles_brundage
Miles Brundage
on x
o3 is crazy and will take some time to adapt to at individual, organizational, and societal levels. Fortunately we have that time. 3 is the highest number
-
@krishnanrohit
Rohit
on x
This is an excellent way to use o3. Hadn't realised it does tasks as well which makes a lot more interesting stuff possible. [image]
-
@packym
Packy McCormick
on x
o3 answers the question I ask all new models: “You have consumed more information than anyone in the history of the world and you've demonstrated an extraordinary ability to make connections among them. What are the most important non-consenus or even not-yet-hypothesized things …
-
@littmath
Daniel Litt
on x
In this thread I'll record some brief impressions from trying to use o3/o4-mini (the new OpenAI models) for mathematical tasks.
-
@smcgrath.phd
Scott McGrath
on bluesky
OpenAI's new “reasoning” models (o3 and o4-mini) actually hallucinate MORE than their predecessors — OpenAI's internal tests show o3 hallucinated on 33% of person-related questions, double the rate of previous models. Even worse, o4-mini hit 48%.
-
@twtzero_
Natsuki
on x
alright party's over [image]
-
@transluceai
@transluceai
on x
We tested a pre-release version of o3 and found that it frequently fabricates actions it never took, and then elaborately justifies these actions when confronted. We were surprised, so we dug deeper 🔎🧵(1/) https://x.com/... [image]
-
@ryan_t_lowe
Ryan Lowe
on x
o3 seems to hallucinate >2x more than o1, according to the system card so hallucinations could scale *inversely* with increased reasoning (unlike for increased model size), bc outcome-based optimization incentivizes confident guessing (the Transluce example is kinda hilarious) [i…
-
@modestproposal1
@modestproposal1
on x
man do some humans need this
-
@littmath
Daniel Litt
on x
First impressions of o3/o4-mini for math: tool use is really great; *lots* of hallucinations; underlying reasoning is maybe slightly better than o1/o3-mini or gemini 2.5 pro but I'm not confident about this.
-
@peterwildeford
Peter Wildeford
on x
Great thread. o3 makes meaningful progress for mathematical applications, excelling at undergrad problems and basic tool use... but still struggles with research-level mathematics, proof construction, and avoiding hallucinations.
-
@natolambert
Nathan Lambert
on x
reasoning models are kind of yolo and brining the fun back to AI caveat: lots of ways we dont know what happens when theyre in the world
-
@transluceai
@transluceai
on x
These behaviors are surprising. It seems that despite being incredibly powerful at solving math and coding tasks, o3 is not by default truthful about its capabilities. (12/)
-
@aashaysachdeva
Aashay Sachdeva
on x
o-series now mixes tool usage during training. This is definitely an issue stemming from complexity of training with multi-tool setup. Model is now hallucinating in the tools space. The AI safety research also got a lot more interesting - llms with tools have ability to [image]
-
@angie_rasmussen
Dr. Angela Rasmussen
on x
The model has improved and is now capable of making shit up about why it made shit up
-
@chowdhuryneil
Neil Chowdhury
on x
Another transcript: o3 confidently claims it executed code and defends its incorrect calculations. First, o3 tells me about its Python sandbox (which it does not have access to!) 🧵 (1/) [image]
-
@dorialexander
Alexander Doria
on x
Very insightful early tests of o3 showing that a top frontier “PhD-level” model remain totally unreliable for a wide variety of mundane tasks.
-
@emollick
Ethan Mollick
on x
A potential issue with o3 is that it thinks it is using tools even when it does not, leading to some hallucinations where it assumes work that was implied in the reasoning chain was actually done. You should double check the reasoning trace for complex work to see what it did.
-
r/technology
r
on reddit
OpenAI Puzzled as New Models Show Rising Hallucination Rates
-
r/BetterOffline
r
on reddit
OpenAI's new reasoning AI models hallucinate more | TechCrunch
-
r/ArtistHate
r
on reddit
If this is not a sign that LLMs have peaked, I don't know what it is
-
r/Futurism
r
on reddit
OpenAI Puzzled as New Models Show Rising Hallucination Rates
-
r/artificial
r
on reddit
OpenAI's new reasoning AI models hallucinate more
-
r/singularity
r
on reddit
OpenAI's new reasoning AI models hallucinate more | TechCrunch
-
r/OpenAI
r
on reddit
OpenAI's new reasoning AI models hallucinate more