Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
AI Apple

Study Done By Apple AI Scientists Proves LLMs Have No Ability to Reason (appleinsider.com) 43

Slashdot reader Rick Schumann shared this report from the blog AppleInsider: A new paper from Apple's artificial intelligence scientists has found that engines based on large language models, such as those from Meta and OpenAI, still lack basic reasoning skills.

The group has proposed a new benchmark, GSM-Symbolic, to help others measure the reasoning capabilities of various large language models (LLMs). Their initial testing reveals that slight changes in the wording of queries can result in significantly different answers, undermining the reliability of the models. The group investigated the "fragility" of mathematical reasoning by adding contextual information to their queries that a human could understand, but which should not affect the fundamental mathematics of the solution. This resulted in varying answers, which shouldn't happen...

The study found that adding even a single sentence that appears to offer relevant information to a given math question can reduce the accuracy of the final answer by up to 65 percent. "There is just no way you can build reliable agents on this foundation, where changing a word or two in irrelevant ways or adding a few bit of irrelevant info can give you a different answer," the study concluded... "We found no evidence of formal reasoning in language models," the new study concluded. The behavior of LLMS "is better explained by sophisticated pattern matching" which the study found to be "so fragile, in fact, that [simply] changing names can alter results."

Study Done By Apple AI Scientists Proves LLMs Have No Ability to Reason

Comments Filter:
  • Duh (Score:4, Insightful)

    by locater16 ( 2326718 ) on Sunday October 13, 2024 @05:53PM (#64861565)
    "Reasoning" was never part of the fundamental LLM model. But if you brute force it enough it'll do something kinda cool, which is enough to get money, which is enough to get thousands upon thousands of people brute forcing it, fundamentals be damned.
    • by gweihir ( 88907 )

      You beat me to it. LLMs cannot reason and will never be able to reason. The very approach does not allow it. Obviously, it can regurgitate reasoning steps it has seen in its training data with not very high reliability, but that is not reasoning. That is faking it. Since most people only have very limited or no reasoning ability themselves, they are then deeply impressed by the fake.

      • You beat me to it. LLMs cannot reason and will never be able to reason. The very approach does not allow it. Obviously, it can regurgitate reasoning steps it has seen in its training data with not very high reliability, but that is not reasoning. That is faking it. Since most people only have very limited or no reasoning ability themselves, they are then deeply impressed by the fake.

        Fundamentally I agree, but devils advocate here. You haven't said what this "reasoning" thing is. How do you know it isn't just a feature of a larger model?

        What they have done here is to assert the existence of a class of problems which they say require reasoning to solve efficiently. They can create a more or less infinite set of such examples and manipulate them so that systems with "reasoning" (say graduate students as in comments below) can solve them easily and quickly and LLMs cannot.

        That's a reasonab

      • It's even worse than that.

        As more people use LLMs, more content will be LLM generated - and LLMs can't tell the difference between generated content (which is inappropriate to train on), and "real" content that may actually include some percentage of actual reasoning steps that could be used to fake some % of reason.

        AI will ultimately destroy itself by devouring its own tail.

        • by dfghjk ( 711126 )

          how do you know that generated content is "inappropriate to train on"? AI is in its infancy and every aspect is under rapid development. Some say that generated content may be problematic, but so called experts demonstrate a lot of basic ignorance constantly. For all we know, that could soon not be a problem at all, it's certainly not a problem in this context.

    • by Kisai ( 213879 )

      Water is wet, ducks report.

      I know the mundane, non-tech literate people need to be told this, but every single LLM, "AI" thing out there is not intelligent. It's a trained parrot. The Parrot does not understand english, it does not "speak" it "mimics"

    • Intelligence? I don’t think we even have a convincing definition of intelligence in general – let alone one sufficiently refined to allow one to distinguish between artificial and natural intelligence. That said, I bet LLMs will probably help intelligent people do great things.

    • "Reasoning" was never part of the fundamental LLM model. But if you brute force it enough it'll do something kinda cool, which is enough to get money, which is enough to get thousands upon thousands of people brute forcing it, fundamentals be damned.

      I think the only reason LLMs are so popular over other ML approaches is that LLMs self learn patterns, while many other (stronger) ML approaches require thousands or millions of labeled training samples. This also mean that LLMs try to find patterns but have nothing to tell them when they are right.

    • by dfghjk ( 711126 )

      but "reasoning" is a part of intelligence, and therefore AI would exhibit reasoning if it weren't a fraud. And there is no such thing as "the fundamental LLM". Also, you don't say "LLM model", that is redundant.

  • Reasoning (Score:5, Interesting)

    by systemd-anonymousd ( 6652324 ) on Sunday October 13, 2024 @05:58PM (#64861573)

    Future models can be improved for formalized reasoning, but what's beyond obvious at this point is that next-token text prediction is far more powerful than anyone ever imagined. Our current models out-perform graduate students and can be massive helps for professionals. It's still up to you to figure out how to integrate them into your workflow. For professional software engineering I've found them to be hugely useful, like a rubber duck that also has instant PhD level knowledge on specific tasks that often I'm learning or only familiar with. It's a productivity booster and a much better search engine, most of the time.

    • by bjoast ( 1310293 )
      No, this is Slashdot. You are supposed to think that recent advances in LLMs are completely pointless and that GPT-4o is only slightly better than Google Search was in 2001.
      • That understanding actually comes from an understanding of data structures and algorithms and knowing that is in fact a very modest advancement on the same level as search. It also comes from having lived through a number of technological revolutions, each of which featured all manner of con-artists and hucksters promising things that could not be delivered.

        That's part of what makes computer science a science. We make predictions then we test those predictions. Were those who said "That's not intelligence

    • by jythie ( 914043 )
      One does not flow from the other. yes, they will likely improve and have greater utility, but one of the classica weakness of ML based AI is its lack of symbolic reasoning and that probably isn't going to change until we get a new generation of researchers that reopen the old pre-ML models.. which is probably going to take a while since those still have the taint of uncool (and unprofitable). By the time people return to them, the people who had worked on them will probably be out of industry, so welcome
      • Why would we need a new generation of researchers for that? That's baseless

    • I do agree there. A few years ago, ChatGPT would be considered an "intern" level. Good enough to fetch stuff, but makes mistakes. Now with the newer LLMs, it is able to do more things, such as generating SCAD models from text. However, it still has a way to go, as the models it does generate may not make sense, or need cleanup.

      We have come far with the newer models, but we have come far with a lot of advances, and eventually diminishing returns hit until we go find another technology, perhaps some other

    • Hugely useful for software development?
      In specific tasks of boiler plate code generation maybe.
      The training data for these code models is open source code repositories. At best you can hope for "average" code.
      The model doesn't know which bits of code do what, or if they do it without bugs
      It doesn't know how "good" the code is.

      I'm sure its really good at reciting common examples to common questions.

      • If you're not able to articulate specific tasks, the exact inputs and outputs that are provided and what you desire, and give it enough context to understand the interoperability with your existing system, then you're not a good engineer anyway. Without that all you can expect is boilerplate or leetcode copy-paste, which is obviously not useful to a competent engineer.

      • I find it useful for the kind of thing you'd be confortable pushing to an intern with stack overflow. And that's not bad.

        In no way, these LLMs have revolutionize the way I write code But for many simpel tasks, they work reasonnably well. And once you only try to leverage them in these situation you can gain decent productivity!

        For me, I find it particularly helpful to solve si ple tasks in tech I don't know well. I am an HPC scientist. So often I set up benchmarks on weird code bases, writen in a variety o

        • The code assistant I use routinely suggests completions that are accurate and that are compatible with surrounding code. Hit tab and you saved a lot of typing. I can compose a paragraph of comments describing what I want to do next and it does a fair job of just writing out the whole thing.

    • Our current models out-perform graduate students.

      That's more a statement that your tests are broken than a statement that the models are working.

  • ELIZA? (Score:4, Insightful)

    by ctilsie242 ( 4841247 ) on Sunday October 13, 2024 @06:01PM (#64861579)

    Sometimes I wonder that even with all the nodes and capacity that modern LLMs have, we are not that far away from good old ELIZA back in the 70s. We have gone far with CPU, disk, RAM and such, but we may need to go a different route completely for AGI/ASI.

    • by jythie ( 914043 )
      As the saying goes... not matter how much money you spend, you can't turn a pig into a racehorse. You can however get an awful fast pig. ML in general has done well because it can go REALLY fast due to how it can be mapped onto cheap GPUs.. but its limitations have not changed.
  • Any study that sets a high standard for "reasoning" finds the null result with all AI systems. Gary Marcus is but the most prominent expert who flummoxes AI without fail. Apple's study is clearly done correctly in that vein.

    What's meant by "standard"? If AI's "reasoning" capabilities were judged against that standard set by Marjorie Taylor-Greene, Matt Gaetz or even Donald Trump, then it would assuredly be deemed intelligent because it spews far less stupid things than these characters.

    If only intellige
  • by xtal ( 49134 )

    O1-mini easily identified the information was irrelevant and produces the correct answers .. it is also the first model with basic advanced reasoning.

    Insane resources are deployed. It is a race to when they stop getting smarter.. not there yet.

  • I havent done the test recently so the results may have changed. If you ask the llm to produce a code to solve knapsack it will produce the standard dynamic programming.
    If you describe a problem as a set of objects with weights and value and you try to select a subset of objects whose sum weight fits within a capacity and that maximizes sum of value, it produces a dynamic programming.
    If you present it as a problem where you have a set of zorglub with foo and bar prodperties and you want to select a subset

  • How could it pass a Turing test without reasoning? It seems that an important component of a test for human thinking would focus on reasoning and logic.
  • It takes a freaking nuclear power plant to brute-force trillions of floating-point matrix operations. I'd say that's a pretty hard limit. However AI is actually going to be done, it can't be done like that. I experimented with a very simple AI rules engine decades ago, and ran right into the computational complexity wall after just a few dozen levels of reasoning. When I showed the results to the customer, he told me he wasn't looking for a machine to replace human reasoning, he wanted a machine to just run

No one gets sick on Wednesdays.

Working...