Apple AI researchers say they found no evidence of formal reasoning in language models and their behavior is better explained by sophisticated pattern matching

RE: https://www.threads.net/... Brian Penny / @thebrianpenny : Great quick read from Gary Marcus pointing to several studies from ML researchers at Apple and Stanford pointing out the complete and utter failure of large language models to truly reason. It's all a parlor trick, and people falling for the marketing buzz should rethink their position. … Mastodon: Chris Espinosa / @Cdespinosa@mastodon.social : Language models are not knowledge models. Which is why language models still confidently explain how to construct a square with the area of a unit circle using a compass and straightedge, a fact disproven centuries ago https://arxiv.org/... Dare Obasanjo / @carnage4life@mas.to : Researchers from Apple have published a paper showing that what LLMs do is sophisticated pattern matching not reasoning. — This is a problem for anyone who belueves they can build autonomous AI agents on this foundation since it means anytime the “agent” sees a pattern it doesn't recognize, it will fail hilariously or even catastrophically. … Jason Gorman / @jasongorman@mastodon.cloud : What many of us kind of sort of knew is now being born out by research. LLMs don't reason (and probably never will), and that's a *big* problem. — We've just been dazzled by the statistical complexity of our own natural languages, and fooled by our well-known tendency to project agency on to things that *appear* even vaguely human. … Alex Feinman / @afeinman@wandering.shop : If your system isn't built for logical reasoning then why would you expect it to be able to reason? — God I hate this damn mirror experiment we're living through. — https://garymarcus.substack.com/ ... Pedro J. Hdez / @ecosdelfuturo@mstdn.social : As an outsider I was under the impression that this statement was standard knowledge. But, apparently, it was not. — “we found no evidence of formal reasoning in language models .... Their behavior is better explained by sophisticated pattern matching—so fragile, in fact, that changing names can alter results by ~10%!” … TJ Usiyan / @griotspeak@soc.mod-12.com : Wouldn't be a problem if ‘we’ acknowledged and accepted that and—in turn—took that into consideration when deciding where to use it... https://garymarcus.substack.com/ ... @max@wetdry.world : https://garymarcus.substack.com/ ... this is pretty interesting Erin / @ChateauErin@mastodon.social : Apparently Apple has published a paper on how LLMs don't do reasoning. This isn't a surprise if you know what LLMs are, but might be helpful to defusing some of the mainstream idiocy going on with them. https://arxiv.org/... Heard about via https://garymarcus.substack.com/ ... @yogthos@social.marxist.network : “we found no evidence of formal reasoning in language models .... Their behavior is better explained by sophisticated pattern matching—so fragile, in fact, that changing names can alter results by ~10%!” — #technology #MachineLearning — https://garymarcus.substack.com/ ... @ShadowJonathan@tech.lgbt : Apple did the research; LLMs cannot do formal reasoning. Results change by as much as 10% if something as basic as the names change. — https://garymarcus.substack.com/ ... Bluesky: @tyrantcbass.bsky.social : it's kind of funny that the only company which can be honest about machine learning is the one that already earns 300 hundred billion dollars every year selling actual physical objects [embedded post] James Munns / @jamesmunns.com : > build stochastic pattern matching machine — > people claim it has actual intellgence — > look inside the machine — > all stochastic pattern matching, no intelligence [embedded post] @guntoucher.bsky.social : NORM MACDONALD: in a follow up study, researchers also found no evidence of formal reasoning among the top names in technology leadership [embedded post] X: Mehrdad Farajtabar / @mfarajtabar : 1/ Can Large Language Models (LLMs) truly reason? Or are they just sophisticated pattern matchers? In our latest preprint, we explore this key question through a large-scale study of both open-source like Llama, Phi, Gemma, and Mistral and leading closed models, including the [image] Bindu Reddy / @bindureddy : While some people are debating if LLMs can reason... Some others are busy applying LLMs to reasoning and hard coding problems and solving them AI's actions speak louder than human words @davidad : When @GaryMarcus and others (including myself) say that LLMs do not “reason,” we mean something quite specific, but it's hard to put one's finger on it, until now. Specifically, Transformers do not generalize algebraic structures out of distribution. Gary Marcus / @garymarcus : Proven correct once again by the new Apple paper Ethan Mollick / @emollick : This is made worse by the fact that people who dislike LLMs pick a definition of reasoning where LLMs obviously fail, and boosters pick a definition where it succeeds, and all along we don't actually have good definitions for what reasoning means in humans. Lots of crosstalk Simon Willison / @simonw : Confession: despite all of the debates about whether or not an LLM can “reason”, I still don't really understand exactly what the term “reasoning” means So just like with “agents” and “AI” itself, I'm not sure the people engaged in those debates are talking about the same thing François Chollet / @fchollet : One more piece of evidence to add to the pile. This was an extremely heretic viewpoint in early 2023, and now it is increasingly becoming self-evident conventional wisdom. [image] Cameron Jones / @camrobjones : This looks like great work and the name change stuff definitely suggests some kind of overfitting or contamination. But I wish people would stop speculating about human performance without just running human baselines! Chris Paxton / @chris_j_paxton : Very interesting thread, although i actually am not convinced changing names or adding random clauses wouldn't have a similar effect on human reasoning. Cool benchmark for testing avoidance of contamination and symbolic reasoning Nouha Dziri / @nouhadziri : Very cool work! This is a reminder again that scoring high on a math benchmark does not mean necessarily that LMs can truly “reason” or “think” It validates “faith and fate” observations for skeptics🙂 You may think of pattern-matching as one type of reasoning which can indeed👇 Ali Hatamizadeh / @ahatamiz1 : I applaud the authors for proposing a new benchmark that goes beyond GSM8K. But after reading this work, it seems that most of the identified issues are due to poor associative-recall performance and not reasoning per se. For example, this work introduces GSM-NoOp in which a @jonmasters : Great thread coming out of @Apple and I look forward to reading the paper. My personal opinion remains that it is obvious LLMs don't “reason” in the way that humans do. They multiply matrices. They're very good at it, but emergent general intelligence this is not Gary Marcus / @garymarcus : LLMs are cooked. So many investors are going to lose sooo much money. AI will survive, and even thrive, but a new paradigm is needed. Gary Marcus / @garymarcus : Yes this a big result, quite concerning for LLMs and quite vindicating for the core concerns I raised in 1998 and 2001. Check out my Substack today [link below] for a longer discussion of how this new paper by @MFarajtabar's team fits in with a range of recent results from @rao2z : Well, it was beyond heretic, bordering on crackpot in Summer 2022 when we released the “LLMs Still Can't Plan”.. @karthikv792 still remembers standing next to the *only* skeptical poster amidst an ocean of FMDM'22 “CAN DO” posters Too bad Science is not decided in the long run [image] Ryota Kanai / @kanair : @fchollet We are trying to get LLMs to perform basic instruction-following as an essential component for solving reasoning tasks. Even for such simple tasks, they fail when multiple steps are involved. It seems we need to solve simple rule-based operation first. https://x.com/... Gary Marcus / @garymarcus : longer, more in depth discussion of why LLMs are cooked, here, relating this new study to many others: https://open.substack.com/... Guy Van den Broeck / @guyvdb : @fchollet https://x.com/... I think we were trying to make a very similar point in 2022. Any distribution over reasoning problems that you train on has statistical features that allow you to solve that familiar reasoning problem without solving it robustly. Bilawal Sidhu / @bilawalsidhu : AI narrative violation detected. Severity: high. Apple's research finds no evidence of logical reasoning in SOTA large language models like GPT-4o and o1. Needless to say, @GaryMarcus feeling good today! Popcorn worthy comments section 🍿 [image] Mehrdad Farajtabar / @mfarajtabar : 13/ Overall, we found no evidence of formal reasoning in language models including open-source models like #Llama, #Phi, #Gemma, and #Mistral and leading closed models, including the recent #OpenAI #GPT-4o and #o1-series. Their behavior is better explained by sophisticated Nathan Lambert / @natolambert : Great example of current shortcomings in llms. Short term, tool use will be used for any math sensitive application. Long term, I bet we can bake this into the models, 100% accuracy. Amir Yazdanbakhsh / @ayazdanb : @MFarajtabar Great work Mehrdad! You may find our work relevant. We studied a range of symbolic alteration of prompts and their impacts on model performance: https://openreview.net/... Gary Marcus / @garymarcus : 👇Superb new article from @apple AI: “we found no evidence of formal reasoning in language models . Their behavior is better explained by sophisticated pattern matching—so fragile, in fact, that changing names can alter results by ~10%!” 𝗧𝗵𝗲𝗿𝗲 𝗶𝘀 𝗷𝘂𝘀𝘁 𝗻𝗼 𝘄𝗮𝘆 LinkedIn: Alexis B. : Just because it feels like it does doesn't mean that it actually does. Always know the limitations of your tools and have a process to mitigate them. … Kyle Gao : LLMs can't reliably reason with math questions. The GSM8K dataset has questions like this: “Lisa has 5 apples. She buys 7 more apples. … Peter Cotton : “We hypothesize that this decline is due to the fact that current LLMs are not capable of genuine logical reasoning; instead … Mehrzad Samadi : Interesting paper done by Mehrdad Farajtabar and his colleagues! — “Our findings reveal that LLMs exhibit noticeable variance when responding to different instantiations of the same question. … Alex Trouteaud : A team of Apple researchers argue that even today's best-in-class AI systems are far away from being the reasoning engines they appear to be. … Hamdi Amroun, PhD : A recent study has explored whether Large Language Models (LLMs), such as GPT-4 and others, truly reason or merely match patterns when solving mathematical problems. … Forums: Hacker News : Apple study proves LLM-based AI models are flawed because they cannot reason Hacker News : LLMs don't do formal reasoning r/technology : Apple study proves LLM-based AI models are flawed because they cannot reason r/technology : Apple's study proves that LLM-based AI models are flawed because they cannot reason r/ChatGPT : Apple Research Paper : LLM's cannot reason. They rely on complex pattern matching r/OpenAI : Apple Research Paper : LLM's cannot reason . They rely on complex pattern matching . r/LocalLLaMA : GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models - From Apple lobste.rs : LLMs don't do formal reasoning - and that is a HUGE problem

Marcus on AI 2024-10-13 Gary Marcus

Discussion

@rstephens Robert Stephens on threads
It was incredibly generous for the Apple researchers to wait to release this until after OpenAI fundraising round — RE: https://www.threads.net/...
@thebrianpenny Brian Penny on threads
Great quick read from Gary Marcus pointing to several studies from ML researchers at Apple and Stanford pointing out the complete and utter failure of large language models to truly reason. It's all a parlor trick, and people falling for the marketing buzz should rethink their p…
@Cdespinosa@mastodon.social Chris Espinosa on mastodon
Language models are not knowledge models. Which is why language models still confidently explain how to construct a square with the area of a unit circle using a compass and straightedge, a fact disproven centuries ago https://arxiv.org/...
@jasongorman@mastodon.cloud Jason Gorman on mastodon
What many of us kind of sort of knew is now being born out by research. LLMs don't reason (and probably never will), and that's a *big* problem. — We've just been dazzled by the statistical complexity of our own natural languages, and fooled by our well-known tendency to proje…
@ChateauErin@mastodon.social Erin on mastodon
Apparently Apple has published a paper on how LLMs don't do reasoning. This isn't a surprise if you know what LLMs are, but might be helpful to defusing some of the mainstream idiocy going on with them. https://arxiv.org/... Heard about via https://garymarcus.substack.com/ ...
@tyrantcbass.bsky.social @tyrantcbass.bsky.social on bluesky
it's kind of funny that the only company which can be honest about machine learning is the one that already earns 300 hundred billion dollars every year selling actual physical objects [embedded post]
@jamesmunns.com James Munns on bluesky
> build stochastic pattern matching machine — > people claim it has actual intellgence — > look inside the machine — > all stochastic pattern matching, no intelligence [embedded post]
@guntoucher.bsky.social @guntoucher.bsky.social on bluesky
NORM MACDONALD: in a follow up study, researchers also found no evidence of formal reasoning among the top names in technology leadership [embedded post]
@mfarajtabar Mehrdad Farajtabar on x
1/ Can Large Language Models (LLMs) truly reason? Or are they just sophisticated pattern matchers? In our latest preprint, we explore this key question through a large-scale study of both open-source like Llama, Phi, Gemma, and Mistral and leading closed models, including the [im…
@bindureddy Bindu Reddy on x
While some people are debating if LLMs can reason... Some others are busy applying LLMs to reasoning and hard coding problems and solving them AI's actions speak louder than human words
@davidad @davidad on x
When @GaryMarcus and others (including myself) say that LLMs do not “reason,” we mean something quite specific, but it's hard to put one's finger on it, until now. Specifically, Transformers do not generalize algebraic structures out of distribution.
@garymarcus Gary Marcus on x
Proven correct once again by the new Apple paper
@emollick Ethan Mollick on x
This is made worse by the fact that people who dislike LLMs pick a definition of reasoning where LLMs obviously fail, and boosters pick a definition where it succeeds, and all along we don't actually have good definitions for what reasoning means in humans. Lots of crosstalk
@simonw Simon Willison on x
Confession: despite all of the debates about whether or not an LLM can “reason”, I still don't really understand exactly what the term “reasoning” means So just like with “agents” and “AI” itself, I'm not sure the people engaged in those debates are talking about the same thing
@fchollet François Chollet on x
One more piece of evidence to add to the pile. This was an extremely heretic viewpoint in early 2023, and now it is increasingly becoming self-evident conventional wisdom. [image]
@camrobjones Cameron Jones on x
This looks like great work and the name change stuff definitely suggests some kind of overfitting or contamination. But I wish people would stop speculating about human performance without just running human baselines!
@chris_j_paxton Chris Paxton on x
Very interesting thread, although i actually am not convinced changing names or adding random clauses wouldn't have a similar effect on human reasoning. Cool benchmark for testing avoidance of contamination and symbolic reasoning
@nouhadziri Nouha Dziri on x
Very cool work! This is a reminder again that scoring high on a math benchmark does not mean necessarily that LMs can truly “reason” or “think” It validates “faith and fate” observations for skeptics🙂 You may think of pattern-matching as one type of reasoning which can indeed👇
@ahatamiz1 Ali Hatamizadeh on x
I applaud the authors for proposing a new benchmark that goes beyond GSM8K. But after reading this work, it seems that most of the identified issues are due to poor associative-recall performance and not reasoning per se. For example, this work introduces GSM-NoOp in which a
@jonmasters @jonmasters on x
Great thread coming out of @Apple and I look forward to reading the paper. My personal opinion remains that it is obvious LLMs don't “reason” in the way that humans do. They multiply matrices. They're very good at it, but emergent general intelligence this is not
@garymarcus Gary Marcus on x
LLMs are cooked. So many investors are going to lose sooo much money. AI will survive, and even thrive, but a new paradigm is needed.
@garymarcus Gary Marcus on x
Yes this a big result, quite concerning for LLMs and quite vindicating for the core concerns I raised in 1998 and 2001. Check out my Substack today [link below] for a longer discussion of how this new paper by @MFarajtabar's team fits in with a range of recent results from
@rao2z @rao2z on x
Well, it was beyond heretic, bordering on crackpot in Summer 2022 when we released the “LLMs Still Can't Plan”.. @karthikv792 still remembers standing next to the *only* skeptical poster amidst an ocean of FMDM'22 “CAN DO” posters Too bad Science is not decided in the long run [i…
@kanair Ryota Kanai on x
@fchollet We are trying to get LLMs to perform basic instruction-following as an essential component for solving reasoning tasks. Even for such simple tasks, they fail when multiple steps are involved. It seems we need to solve simple rule-based operation first. https://x.com/...
@garymarcus Gary Marcus on x
longer, more in depth discussion of why LLMs are cooked, here, relating this new study to many others: https://open.substack.com/...
@guyvdb Guy Van den Broeck on x
@fchollet https://x.com/... I think we were trying to make a very similar point in 2022. Any distribution over reasoning problems that you train on has statistical features that allow you to solve that familiar reasoning problem without solving it robustly.
@bilawalsidhu Bilawal Sidhu on x
AI narrative violation detected. Severity: high. Apple's research finds no evidence of logical reasoning in SOTA large language models like GPT-4o and o1. Needless to say, @GaryMarcus feeling good today! Popcorn worthy comments section 🍿 [image]
@mfarajtabar Mehrdad Farajtabar on x
13/ Overall, we found no evidence of formal reasoning in language models including open-source models like #Llama, #Phi, #Gemma, and #Mistral and leading closed models, including the recent #OpenAI #GPT-4o and #o1-series. Their behavior is better explained by sophisticated
@natolambert Nathan Lambert on x
Great example of current shortcomings in llms. Short term, tool use will be used for any math sensitive application. Long term, I bet we can bake this into the models, 100% accuracy.
@ayazdanb Amir Yazdanbakhsh on x
@MFarajtabar Great work Mehrdad! You may find our work relevant. We studied a range of symbolic alteration of prompts and their impacts on model performance: https://openreview.net/...
@garymarcus Gary Marcus on x
👇Superb new article from @apple AI: “we found no evidence of formal reasoning in language models . Their behavior is better explained by sophisticated pattern matching—so fragile, in fact, that changing names can alter results by ~10%!” 𝗧𝗵𝗲𝗿𝗲 𝗶𝘀 𝗷𝘂𝘀𝘁 𝗻𝗼 𝘄𝗮𝘆
r/technology r on reddit
Apple study proves LLM-based AI models are flawed because they cannot reason
r/technology r on reddit
Apple's study proves that LLM-based AI models are flawed because they cannot reason
r/ChatGPT r on reddit
Apple Research Paper : LLM's cannot reason. They rely on complex pattern matching
r/OpenAI r on reddit
Apple Research Paper : LLM's cannot reason . They rely on complex pattern matching .
r/LocalLLaMA r on reddit
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models - From Apple

Chronicles

Apple AI researchers say they found no evidence of formal reasoning in language models and their behavior is better explained by sophisticated pattern matching

Related Coverage

Discussion