Apple AI researchers say they found no evidence of formal reasoning in language models and their behavior is better explained by sophisticated pattern matching
Important new study from Apple — A superb new article on LLMs from six AI researchers at Apple who were brave enough …
Marcus on AI Gary Marcus
Related Coverage
- GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models arXiv
- Apple's study proves that LLM-based AI models are flawed because they cannot reason AppleInsider · Charles Martin
- Apple researchers find Large Language Models lack robust mathematical reasoning abilities; here's why Business Today · Pranav Dixit
- AI's reasoning ability in mathematics questionable, reveal Apple researchers International Business Times
- IBM Researchers ACPBench: An AI Benchmark for Evaluating the Reasoning Tasks in the Field of Planning MarkTechPost · Adeeba Alam Ansari
- Apple AI researchers question OpenAI's claims about o1's reasoning capabilities The Decoder · Matthias Bastian
- Researchers question AI's ‘reasoning’ ability as models stumble on math problems with trivial changes TechCrunch · Devin Coldewey
- Researchers from Apple have published a paper showing that what LLMs do is sophisticated pattern matching not reasoning. — This is a problem for anyone who belueves they can build autonomous AI agents on this foundation since it means anytime the “agent” sees a pattern it doesn't recognize, it will fail hilariously or even catastrophically. … @carnage4life@mas.to · Dare Obasanjo
- Apple did the research; LLMs cannot do formal reasoning. Results change by as much as 10% if something as basic as the names change. — https://garymarcus.substack.com/ ... @ShadowJonathan@tech.lgbt
- https://garymarcus.substack.com/ ... this is pretty interesting @max@wetdry.world
- If your system isn't built for logical reasoning then why would you expect it to be able to reason? — God I hate this damn mirror experiment we're living through. — https://garymarcus.substack.com/ ... @afeinman@wandering.shop · Alex Feinman
- “we found no evidence of formal reasoning in language models .... Their behavior is better explained by sophisticated pattern matching—so fragile, in fact, that changing names can alter results by ~10%!” — #technology #MachineLearning — https://garymarcus.substack.com/ ... @yogthos@social.marxist.network
- Wouldn't be a problem if ‘we’ acknowledged and accepted that and—in turn—took that into consideration when deciding where to use it... https://garymarcus.substack.com/ ... @griotspeak@soc.mod-12.com · TJ Usiyan
- As an outsider I was under the impression that this statement was standard knowledge. But, apparently, it was not. — “we found no evidence of formal reasoning in language models .... Their behavior is better explained by sophisticated pattern matching—so fragile, in fact, that changing names can alter results by ~10%!” … @ecosdelfuturo@mstdn.social · Pedro J. Hdez
- Interesting paper done by Mehrdad Farajtabar and his colleagues! — “Our findings reveal that LLMs exhibit noticeable variance when responding to different instantiations of the same question. … Mehrzad Samadi
- “We hypothesize that this decline is due to the fact that current LLMs are not capable of genuine logical reasoning; instead … Peter Cotton
- A team of Apple researchers argue that even today's best-in-class AI systems are far away from being the reasoning engines they appear to be. … Alex Trouteaud
- A recent study has explored whether Large Language Models (LLMs), such as GPT-4 and others, truly reason or merely match patterns when solving mathematical problems. … Hamdi Amroun, PhD
- LLMs don't do formal reasoning Hacker News
- LLMs don't do formal reasoning - and that is a HUGE problem lobste.rs
Discussion
-
@thebrianpenny
Brian Penny
on threads
Great quick read from Gary Marcus pointing to several studies from ML researchers at Apple and Stanford pointing out the complete and utter failure of large language models to truly reason. It's all a parlor trick, and people falling for the marketing buzz should rethink their p…
-
@jasongorman@mastodon.cloud
Jason Gorman
on mastodon
What many of us kind of sort of knew is now being born out by research. LLMs don't reason (and probably never will), and that's a *big* problem. — We've just been dazzled by the statistical complexity of our own natural languages, and fooled by our well-known tendency to proje…
-
@ChateauErin@mastodon.social
Erin
on mastodon
Apparently Apple has published a paper on how LLMs don't do reasoning. This isn't a surprise if you know what LLMs are, but might be helpful to defusing some of the mainstream idiocy going on with them. https://arxiv.org/... Heard about via https://garymarcus.substack.com/ ...
-
@tyrantcbass.bsky.social
@tyrantcbass.bsky.social
on bluesky
it's kind of funny that the only company which can be honest about machine learning is the one that already earns 300 hundred billion dollars every year selling actual physical objects [embedded post]
-
@jamesmunns.com
James Munns
on bluesky
> build stochastic pattern matching machine — > people claim it has actual intellgence — > look inside the machine — > all stochastic pattern matching, no intelligence [embedded post]
-
@guntoucher.bsky.social
@guntoucher.bsky.social
on bluesky
NORM MACDONALD: in a follow up study, researchers also found no evidence of formal reasoning among the top names in technology leadership [embedded post]
-
@mfarajtabar
Mehrdad Farajtabar
on x
1/ Can Large Language Models (LLMs) truly reason? Or are they just sophisticated pattern matchers? In our latest preprint, we explore this key question through a large-scale study of both open-source like Llama, Phi, Gemma, and Mistral and leading closed models, including the [im…
-
@davidad
@davidad
on x
When @GaryMarcus and others (including myself) say that LLMs do not “reason,” we mean something quite specific, but it's hard to put one's finger on it, until now. Specifically, Transformers do not generalize algebraic structures out of distribution.
-
@garymarcus
Gary Marcus
on x
Proven correct once again by the new Apple paper
-
@emollick
Ethan Mollick
on x
This is made worse by the fact that people who dislike LLMs pick a definition of reasoning where LLMs obviously fail, and boosters pick a definition where it succeeds, and all along we don't actually have good definitions for what reasoning means in humans. Lots of crosstalk
-
@simonw
Simon Willison
on x
Confession: despite all of the debates about whether or not an LLM can “reason”, I still don't really understand exactly what the term “reasoning” means So just like with “agents” and “AI” itself, I'm not sure the people engaged in those debates are talking about the same thing
-
@fchollet
François Chollet
on x
One more piece of evidence to add to the pile. This was an extremely heretic viewpoint in early 2023, and now it is increasingly becoming self-evident conventional wisdom. [image]
-
@camrobjones
Cameron Jones
on x
This looks like great work and the name change stuff definitely suggests some kind of overfitting or contamination. But I wish people would stop speculating about human performance without just running human baselines!
-
@chris_j_paxton
Chris Paxton
on x
Very interesting thread, although i actually am not convinced changing names or adding random clauses wouldn't have a similar effect on human reasoning. Cool benchmark for testing avoidance of contamination and symbolic reasoning
-
@kanair
Ryota Kanai
on x
@fchollet We are trying to get LLMs to perform basic instruction-following as an essential component for solving reasoning tasks. Even for such simple tasks, they fail when multiple steps are involved. It seems we need to solve simple rule-based operation first. https://x.com/...
-
@garymarcus
Gary Marcus
on x
longer, more in depth discussion of why LLMs are cooked, here, relating this new study to many others: https://open.substack.com/...
-
@ahatamiz1
Ali Hatamizadeh
on x
I applaud the authors for proposing a new benchmark that goes beyond GSM8K. But after reading this work, it seems that most of the identified issues are due to poor associative-recall performance and not reasoning per se. For example, this work introduces GSM-NoOp in which a
-
@bilawalsidhu
Bilawal Sidhu
on x
AI narrative violation detected. Severity: high. Apple's research finds no evidence of logical reasoning in SOTA large language models like GPT-4o and o1. Needless to say, @GaryMarcus feeling good today! Popcorn worthy comments section 🍿 [image]
-
@guyvdb
Guy Van den Broeck
on x
@fchollet https://x.com/... I think we were trying to make a very similar point in 2022. Any distribution over reasoning problems that you train on has statistical features that allow you to solve that familiar reasoning problem without solving it robustly.
-
@garymarcus
Gary Marcus
on x
Yes this a big result, quite concerning for LLMs and quite vindicating for the core concerns I raised in 1998 and 2001. Check out my Substack today [link below] for a longer discussion of how this new paper by @MFarajtabar's team fits in with a range of recent results from
-
@garymarcus
Gary Marcus
on x
LLMs are cooked. So many investors are going to lose sooo much money. AI will survive, and even thrive, but a new paradigm is needed.
-
@jonmasters
@jonmasters
on x
Great thread coming out of @Apple and I look forward to reading the paper. My personal opinion remains that it is obvious LLMs don't “reason” in the way that humans do. They multiply matrices. They're very good at it, but emergent general intelligence this is not
-
@rao2z
@rao2z
on x
Well, it was beyond heretic, bordering on crackpot in Summer 2022 when we released the “LLMs Still Can't Plan”.. @karthikv792 still remembers standing next to the *only* skeptical poster amidst an ocean of FMDM'22 “CAN DO” posters Too bad Science is not decided in the long run [i…
-
@nouhadziri
Nouha Dziri
on x
Very cool work! This is a reminder again that scoring high on a math benchmark does not mean necessarily that LMs can truly “reason” or “think” It validates “faith and fate” observations for skeptics🙂 You may think of pattern-matching as one type of reasoning which can indeed👇
-
@natolambert
Nathan Lambert
on x
Great example of current shortcomings in llms. Short term, tool use will be used for any math sensitive application. Long term, I bet we can bake this into the models, 100% accuracy.
-
@mfarajtabar
Mehrdad Farajtabar
on x
13/ Overall, we found no evidence of formal reasoning in language models including open-source models like #Llama, #Phi, #Gemma, and #Mistral and leading closed models, including the recent #OpenAI #GPT-4o and #o1-series. Their behavior is better explained by sophisticated
-
@ayazdanb
Amir Yazdanbakhsh
on x
@MFarajtabar Great work Mehrdad! You may find our work relevant. We studied a range of symbolic alteration of prompts and their impacts on model performance: https://openreview.net/...
-
@garymarcus
Gary Marcus
on x
👇Superb new article from @apple AI: “we found no evidence of formal reasoning in language models . Their behavior is better explained by sophisticated pattern matching—so fragile, in fact, that changing names can alter results by ~10%!” 𝗧𝗵𝗲𝗿𝗲 𝗶𝘀 𝗷𝘂𝘀𝘁 𝗻𝗼 𝘄𝗮𝘆