A comparison of OpenAI's o3, o4-mini, and GPT-4.1; Aaron Levie says o3 nailed a multi-step financial modeling task; Scale AI CEO says o3 is “a big breakthrough”

Our take on what's powerful, what's practical, and what's still TBD … If you've been following AI news this week …

Every 2025-04-20

Discussion

@emollick Ethan Mollick on x
The geoguessing power of o3 is a really good sample of its agentic abilities. Between its smart guessing and its ability to zoom into images, to do web searches, and read text, the results can be very freaky. I stripped location info from the photo & prompted “geoguess this” [ima…
@deedydas Deedy on x
o3 really blew my mind with this one. I gave it an image of a menu of my favorite Chinese place in SF with no title or EXIF data, and it was able to search the web, match menu items, and locate it. 🤯 [image]
@smokeawayyy @smokeawayyy on x
o3 uses way too many tables. It's almost unreadable in the app and you can't copy/paste the output using either ‘Copy’ or ‘Select Text’. Seems like an oversight. [image]
@natolambert Nathan Lambert on x
o3's weird hallucinations could indicate they used llm as a judge (or other softer verifiers) in high volume and in addition to math/code correctness. This addition lets OpenAI scale RL by making more data available to train on, but has new downstream problems to solve.
@levie Aaron Levie on x
Here's why the latest reasoning models like OpenAI's o3 are going to make a world of difference for AI Agents in the enterprise. This new generation of models offers huge leaps in math, logic, and coding capabilities, which are incredibly important for advanced enterprise [video]
@tunguz Bojan Tunguz on x
“o3 is AGI” [image]
@deanwball Dean W. Ball on x
I went from never using chatgpt scheduled tasks to having ~20 with o3. News roundups on arbitrarily niche topics delivered at custom intervals. This alone is worth the money.
@bentossell Ben Tossell on x
nvm o3 did this without any mcps needed
@tweetatpablo Pablo Arredondo on x
GPT-4 passed the bar and is great at doc review. o3 is capable of engaging a full litigation record and creating MSJ outline with granular citations mapped to specific elements of cause of action. It is getting surreal.
@mattshumer_ Matt Shumer on x
Prompt: “o3, from first principles, develop a detailed, insanely-well-researched and well-reasoned prediction for the continuous progress of AI over the next 10 years.” [image]
@kaseyklimes Kasey on x
o3 feels like the first model to cross a threshold where it's so smart that I sometimes don't understand wtf it's talking about
@daveshapi David Shapiro on x
o3 full is legitimately the most exciting innovation in AI to me since... probably ChatGPT itself. I say this as someone who was finetuning GPT-2 and GPT-3 before the chatbot era. o3 is a step change in the same magnitude that ChatGPT was in terms of UX and instrumental
@ivanfioravanti Ivan Fioravanti on x
o3 is simply amazing. I can't imagine the power of o3 pro and anything beyond that.
@emollick Ethan Mollick on x
“o3, make me a movie i can download that involves an otter and an airplane. figure out how to do it with the tools you have.” o3 has no movie capability, so It improvises decides to draw each frame and then stitch them together into a GIF to download, this was all first shot [ima…
@littmath Daniel Litt on x
Maybe worth stressing, since I think it may have been lost in the somewhat long thread below—I think the latest OpenAI and Google models (o3/o4-mini/Gemini 2.5 pro) are genuinely useful for some math research tasks.
@grantslatton Grant Slatton on x
a few days ago i was singing claude-code's praises now after using o3 for a few days, i find it intolerably dumb much to consider
@timsoret Tim Soret on x
Ok, O3 is the first LLM actually smarter & more knowledgeable than me on technical topics I deeply master. It's a paradigm change. Other LLMs required heavy guidance if I wanted to go that deep, and often couldn't follow if I went into uncharted territories.
@kwharrison13 Kyle Harrison on x
Talking to o3 is like talking to your friend who takes the time to pause before answering any question.
@gregmushen Greg Mushen on x
Holy smokes o3 is good. It perfectly answered two of my questions that aren't anywhere in the literature, just through pure mechanistic reasoning.
@emollick Ethan Mollick on x
o3 is far more agentic than people realize. Worth playing with a lot more than a typical new model. You can get remarkably complex work out of a single prompt. It just does things. (Of course, that makes checking its work even harder, especially for non-experts.) [image]
@tenreirodaniel Daniel Tenreiro on x
if i tell o3 to use it's memory it makes every response a finance analogy [image]
@gregkamradt Greg Kamradt on x
o3 is the gpt-3.5 > 4 jump we've been waiting for My guess is that the jump to o3-pro won't feel *smarter* but it's going to handle way more long context nuance and complexity. You're duplicating the same person multiple times and giving them longer to think. Same intelligence,
@amasad Amjad Masad on x
o3 spends way too much time browsing the web
@kantrowitz Alex Kantrowitz on x
Okay, o3 is insanely good. Crushing past tests I've given earlier models.
@datachaz Charly Wargnier on x
I must've tested every multimodal LLM out there. None could geolocate this cloudy photo I took from the Pyrenean valley I live in. @OpenAI o3 did. It read the topography, vegetation, architecture, then found the exact spot among 100's of valleys in the Pyrenees! Just stunning. [v…
@aidan_mclau Aidan McLaughlin on x
really good summary of o3's strengths [image]
@danshipper Dan Shipper on x
one of my favorite o3 use cases: mini courses! it can design a course and then use its “reminders” tool to give you a new lesson every day. i had it make me a mini ML course, and i love using it every day. it's a great example of why tools are such a powerful way to help you [ima…
@nabeelqu Nabeel S. Qureshi on x
getting o3 to do the most menial and dumb things for you feels a bit like having Einstein as your butler
@danshipper Dan Shipper on x
o3 is clearly smarter than me on any given question, but i'm still smarter at knowing when to ask a question and what question to ask
@krishnanrohit Rohit on x
one annoying thing about o3 is the new personality, where it constantly tries to act like fratboy einstein. it's grating..
@daniel_mac8 Dan Mac on x
the era of giving an LLM a “prompt” and getting a “response” is over. today's LLMs are NOT chatbots!!! to get the most out of o3, give it: 1. a goal 2. success criteria in return, what you get back is ***cognitive work*** cognitive work towards the stated goal, verified [image]
@kirkegaardemil Emil Kirkegaard on x
AIs are making rapid progress on figure reasoning tests too. O3 gets about 116 IQ, compared to about a year ago, all AIs were below 90. [image]
@spencerkschiff Spencer Schiff on x
It looks like o3 can reason across long context better than any other model, including 2.5 pro! I went over one of this benchmark's example questions a few weeks ago and it seems to be testing for the real deal, not just basic recall. [image]
@eigenrobot @eigenrobot on x
where LLMs will first make contributions to natural science, by o3 >tl;dr early wins where nature is lego‑like, data are fat, and the feedback loop can be automated. clock starts... well, it already did. [image]
@krishnanrohit Rohit on x
Strikes me that o3, if released last year or the year before, would've been seen as extraordinarily dangerous in almost every way
@seconds_0 @seconds_0 on x
I have invented a new insane visual eval for O3 [image]
@aaditsh Aadit Sheth on x
This guy literally built a 3D game with o3 in 10 minutes (no coding needed) [video]
@miles_brundage Miles Brundage on x
o3 is crazy and will take some time to adapt to at individual, organizational, and societal levels. Fortunately we have that time. 3 is the highest number
@krishnanrohit Rohit on x
This is an excellent way to use o3. Hadn't realised it does tasks as well which makes a lot more interesting stuff possible. [image]
@packym Packy McCormick on x
o3 answers the question I ask all new models: “You have consumed more information than anyone in the history of the world and you've demonstrated an extraordinary ability to make connections among them. What are the most important non-consenus or even not-yet-hypothesized things …
@littmath Daniel Litt on x
In this thread I'll record some brief impressions from trying to use o3/o4-mini (the new OpenAI models) for mathematical tasks.
@smcgrath.phd Scott McGrath on bluesky
OpenAI's new “reasoning” models (o3 and o4-mini) actually hallucinate MORE than their predecessors — OpenAI's internal tests show o3 hallucinated on 33% of person-related questions, double the rate of previous models. Even worse, o4-mini hit 48%.
@twtzero_ Natsuki on x
alright party's over [image]
@transluceai @transluceai on x
We tested a pre-release version of o3 and found that it frequently fabricates actions it never took, and then elaborately justifies these actions when confronted. We were surprised, so we dug deeper 🔎🧵(1/) https://x.com/... [image]
@ryan_t_lowe Ryan Lowe on x
o3 seems to hallucinate >2x more than o1, according to the system card so hallucinations could scale *inversely* with increased reasoning (unlike for increased model size), bc outcome-based optimization incentivizes confident guessing (the Transluce example is kinda hilarious) [i…
@modestproposal1 @modestproposal1 on x
man do some humans need this
@littmath Daniel Litt on x
First impressions of o3/o4-mini for math: tool use is really great; *lots* of hallucinations; underlying reasoning is maybe slightly better than o1/o3-mini or gemini 2.5 pro but I'm not confident about this.
@peterwildeford Peter Wildeford on x
Great thread. o3 makes meaningful progress for mathematical applications, excelling at undergrad problems and basic tool use... but still struggles with research-level mathematics, proof construction, and avoiding hallucinations.
@natolambert Nathan Lambert on x
reasoning models are kind of yolo and brining the fun back to AI caveat: lots of ways we dont know what happens when theyre in the world
@transluceai @transluceai on x
These behaviors are surprising. It seems that despite being incredibly powerful at solving math and coding tasks, o3 is not by default truthful about its capabilities. (12/)
@aashaysachdeva Aashay Sachdeva on x
o-series now mixes tool usage during training. This is definitely an issue stemming from complexity of training with multi-tool setup. Model is now hallucinating in the tools space. The AI safety research also got a lot more interesting - llms with tools have ability to [image]
@angie_rasmussen Dr. Angela Rasmussen on x
The model has improved and is now capable of making shit up about why it made shit up
@chowdhuryneil Neil Chowdhury on x
Another transcript: o3 confidently claims it executed code and defends its incorrect calculations. First, o3 tells me about its Python sandbox (which it does not have access to!) 🧵 (1/) [image]
@dorialexander Alexander Doria on x
Very insightful early tests of o3 showing that a top frontier “PhD-level” model remain totally unreliable for a wide variety of mundane tasks.
@emollick Ethan Mollick on x
A potential issue with o3 is that it thinks it is using tools even when it does not, leading to some hallucinations where it assumes work that was implied in the reasoning chain was actually done. You should double check the reasoning trace for complex work to see what it did.
r/technology r on reddit
OpenAI Puzzled as New Models Show Rising Hallucination Rates
r/BetterOffline r on reddit
OpenAI's new reasoning AI models hallucinate more | TechCrunch
r/ArtistHate r on reddit
If this is not a sign that LLMs have peaked, I don't know what it is
r/Futurism r on reddit
OpenAI Puzzled as New Models Show Rising Hallucination Rates
r/artificial r on reddit
OpenAI's new reasoning AI models hallucinate more
r/singularity r on reddit
OpenAI's new reasoning AI models hallucinate more | TechCrunch
r/OpenAI r on reddit
OpenAI's new reasoning AI models hallucinate more

Chronicles

A comparison of OpenAI's o3, o4-mini, and GPT-4.1; Aaron Levie says o3 nailed a multi-step financial modeling task; Scale AI CEO says o3 is “a big breakthrough”

Related Coverage

Discussion