Apple AI researchers say they found no evidence of formal reasoning in language models and their behavior is better explained by sophisticated pattern matching

Important new study from Apple — A superb new article on LLMs from six AI researchers at Apple who were brave enough …

Marcus on AI 2024-10-12 Gary Marcus

Discussion

@thebrianpenny Brian Penny on threads
Great quick read from Gary Marcus pointing to several studies from ML researchers at Apple and Stanford pointing out the complete and utter failure of large language models to truly reason. It's all a parlor trick, and people falling for the marketing buzz should rethink their p…
@jasongorman@mastodon.cloud Jason Gorman on mastodon
What many of us kind of sort of knew is now being born out by research. LLMs don't reason (and probably never will), and that's a *big* problem. — We've just been dazzled by the statistical complexity of our own natural languages, and fooled by our well-known tendency to proje…
@ChateauErin@mastodon.social Erin on mastodon
Apparently Apple has published a paper on how LLMs don't do reasoning. This isn't a surprise if you know what LLMs are, but might be helpful to defusing some of the mainstream idiocy going on with them. https://arxiv.org/... Heard about via https://garymarcus.substack.com/ ...
@tyrantcbass.bsky.social @tyrantcbass.bsky.social on bluesky
it's kind of funny that the only company which can be honest about machine learning is the one that already earns 300 hundred billion dollars every year selling actual physical objects [embedded post]
@jamesmunns.com James Munns on bluesky
> build stochastic pattern matching machine — > people claim it has actual intellgence — > look inside the machine — > all stochastic pattern matching, no intelligence [embedded post]
@guntoucher.bsky.social @guntoucher.bsky.social on bluesky
NORM MACDONALD: in a follow up study, researchers also found no evidence of formal reasoning among the top names in technology leadership [embedded post]
@mfarajtabar Mehrdad Farajtabar on x
1/ Can Large Language Models (LLMs) truly reason? Or are they just sophisticated pattern matchers? In our latest preprint, we explore this key question through a large-scale study of both open-source like Llama, Phi, Gemma, and Mistral and leading closed models, including the [im…
@davidad @davidad on x
When @GaryMarcus and others (including myself) say that LLMs do not “reason,” we mean something quite specific, but it's hard to put one's finger on it, until now. Specifically, Transformers do not generalize algebraic structures out of distribution.
@garymarcus Gary Marcus on x
Proven correct once again by the new Apple paper
@emollick Ethan Mollick on x
This is made worse by the fact that people who dislike LLMs pick a definition of reasoning where LLMs obviously fail, and boosters pick a definition where it succeeds, and all along we don't actually have good definitions for what reasoning means in humans. Lots of crosstalk
@simonw Simon Willison on x
Confession: despite all of the debates about whether or not an LLM can “reason”, I still don't really understand exactly what the term “reasoning” means So just like with “agents” and “AI” itself, I'm not sure the people engaged in those debates are talking about the same thing
@fchollet François Chollet on x
One more piece of evidence to add to the pile. This was an extremely heretic viewpoint in early 2023, and now it is increasingly becoming self-evident conventional wisdom. [image]
@camrobjones Cameron Jones on x
This looks like great work and the name change stuff definitely suggests some kind of overfitting or contamination. But I wish people would stop speculating about human performance without just running human baselines!
@chris_j_paxton Chris Paxton on x
Very interesting thread, although i actually am not convinced changing names or adding random clauses wouldn't have a similar effect on human reasoning. Cool benchmark for testing avoidance of contamination and symbolic reasoning
@kanair Ryota Kanai on x
@fchollet We are trying to get LLMs to perform basic instruction-following as an essential component for solving reasoning tasks. Even for such simple tasks, they fail when multiple steps are involved. It seems we need to solve simple rule-based operation first. https://x.com/...
@garymarcus Gary Marcus on x
longer, more in depth discussion of why LLMs are cooked, here, relating this new study to many others: https://open.substack.com/...
@ahatamiz1 Ali Hatamizadeh on x
I applaud the authors for proposing a new benchmark that goes beyond GSM8K. But after reading this work, it seems that most of the identified issues are due to poor associative-recall performance and not reasoning per se. For example, this work introduces GSM-NoOp in which a
@bilawalsidhu Bilawal Sidhu on x
AI narrative violation detected. Severity: high. Apple's research finds no evidence of logical reasoning in SOTA large language models like GPT-4o and o1. Needless to say, @GaryMarcus feeling good today! Popcorn worthy comments section 🍿 [image]
@guyvdb Guy Van den Broeck on x
@fchollet https://x.com/... I think we were trying to make a very similar point in 2022. Any distribution over reasoning problems that you train on has statistical features that allow you to solve that familiar reasoning problem without solving it robustly.
@garymarcus Gary Marcus on x
Yes this a big result, quite concerning for LLMs and quite vindicating for the core concerns I raised in 1998 and 2001. Check out my Substack today [link below] for a longer discussion of how this new paper by @MFarajtabar's team fits in with a range of recent results from
@garymarcus Gary Marcus on x
LLMs are cooked. So many investors are going to lose sooo much money. AI will survive, and even thrive, but a new paradigm is needed.
@jonmasters @jonmasters on x
Great thread coming out of @Apple and I look forward to reading the paper. My personal opinion remains that it is obvious LLMs don't “reason” in the way that humans do. They multiply matrices. They're very good at it, but emergent general intelligence this is not
@rao2z @rao2z on x
Well, it was beyond heretic, bordering on crackpot in Summer 2022 when we released the “LLMs Still Can't Plan”.. @karthikv792 still remembers standing next to the *only* skeptical poster amidst an ocean of FMDM'22 “CAN DO” posters Too bad Science is not decided in the long run [i…
@nouhadziri Nouha Dziri on x
Very cool work! This is a reminder again that scoring high on a math benchmark does not mean necessarily that LMs can truly “reason” or “think” It validates “faith and fate” observations for skeptics🙂 You may think of pattern-matching as one type of reasoning which can indeed👇
@natolambert Nathan Lambert on x
Great example of current shortcomings in llms. Short term, tool use will be used for any math sensitive application. Long term, I bet we can bake this into the models, 100% accuracy.
@mfarajtabar Mehrdad Farajtabar on x
13/ Overall, we found no evidence of formal reasoning in language models including open-source models like #Llama, #Phi, #Gemma, and #Mistral and leading closed models, including the recent #OpenAI #GPT-4o and #o1-series. Their behavior is better explained by sophisticated
@ayazdanb Amir Yazdanbakhsh on x
@MFarajtabar Great work Mehrdad! You may find our work relevant. We studied a range of symbolic alteration of prompts and their impacts on model performance: https://openreview.net/...
@garymarcus Gary Marcus on x
👇Superb new article from @apple AI: “we found no evidence of formal reasoning in language models . Their behavior is better explained by sophisticated pattern matching—so fragile, in fact, that changing names can alter results by ~10%!” 𝗧𝗵𝗲𝗿𝗲 𝗶𝘀 𝗷𝘂𝘀𝘁 𝗻𝗼 𝘄𝗮𝘆

Chronicles

Apple AI researchers say they found no evidence of formal reasoning in language models and their behavior is better explained by sophisticated pattern matching

Related Coverage

Discussion