Study Done By Apple AI Scientists Proves LLMs Have No Ability to Reason ( 82

Slashdot reader Rick Schumann shared this report from the blog AppleInsider: A new paper from Apple's artificial intelligence scientists has found that engines based on large language models, such as those from Meta and OpenAI, still lack basic reasoning skills.

The group has proposed a new benchmark, GSM-Symbolic, to help others measure the reasoning capabilities of various large language models (LLMs). Their initial testing reveals that slight changes in the wording of queries can result in significantly different answers, undermining the reliability of the models. The group investigated the "fragility" of mathematical reasoning by adding contextual information to their queries that a human could understand, but which should not affect the fundamental mathematics of the solution. This resulted in varying answers, which shouldn't happen...

The study found that adding even a single sentence that appears to offer relevant information to a given math question can reduce the accuracy of the final answer by up to 65 percent. "There is just no way you can build reliable agents on this foundation, where changing a word or two in irrelevant ways or adding a few bit of irrelevant info can give you a different answer," the study concluded... "We found no evidence of formal reasoning in language models," the new study concluded. The behavior of LLMS "is better explained by sophisticated pattern matching" which the study found to be "so fragile, in fact, that [simply] changing names can alter results."

  • Duh (Score:5, Insightful)

    by locater16 ( 2326718 ) on Sunday October 13, 2024 @05:53PM (#64861565)
    "Reasoning" was never part of the fundamental LLM model. But if you brute force it enough it'll do something kinda cool, which is enough to get money, which is enough to get thousands upon thousands of people brute forcing it, fundamentals be damned.
    • Re:Duh (Score:5, Insightful)

      by gweihir ( 88907 ) on Sunday October 13, 2024 @05:57PM (#64861571)

      You beat me to it. LLMs cannot reason and will never be able to reason. The very approach does not allow it. Obviously, it can regurgitate reasoning steps it has seen in its training data with not very high reliability, but that is not reasoning. That is faking it. Since most people only have very limited or no reasoning ability themselves, they are then deeply impressed by the fake.

      • Re: (Score:2, Troll)

        by AleRunner ( 4556245 )

        You beat me to it. LLMs cannot reason and will never be able to reason. The very approach does not allow it. Obviously, it can regurgitate reasoning steps it has seen in its training data with not very high reliability, but that is not reasoning. That is faking it. Since most people only have very limited or no reasoning ability themselves, they are then deeply impressed by the fake.

        Fundamentally I agree, but devils advocate here. You haven't said what this "reasoning" thing is. How do you know it isn't just a feature of a larger model?

        What they have done here is to assert the existence of a class of problems which they say require reasoning to solve efficiently. They can create a more or less infinite set of such examples and manipulate them so that systems with "reasoning" (say graduate students as in comments below) can solve them easily and quickly and LLMs cannot.

        That's a reasonab

      • Re:Duh (Score:5, Interesting)

        by hsthompson69 ( 1674722 ) on Sunday October 13, 2024 @08:09PM (#64861815)

        It's even worse than that.

        As more people use LLMs, more content will be LLM generated - and LLMs can't tell the difference between generated content (which is inappropriate to train on), and "real" content that may actually include some percentage of actual reasoning steps that could be used to fake some % of reason.

        AI will ultimately destroy itself by devouring its own tail.

        • by dfghjk ( 711126 )

          how do you know that generated content is "inappropriate to train on"? AI is in its infancy and every aspect is under rapid development. Some say that generated content may be problematic, but so called experts demonstrate a lot of basic ignorance constantly. For all we know, that could soon not be a problem at all, it's certainly not a problem in this context.

          • by narcc ( 412956 )

            how do you know that generated content is "inappropriate to train on"?

            This is a well-understood phenomenon. It's also intuitively obvious. I even described it in an off-hand way here months before the paper that coined the term "model collapse".

            but so called experts demonstrate a lot of basic ignorance constantly

            So... we should listen to people without any relevant knowledge or experience? If that's what you're after, LessWrong has no shortage of uneducated crackpots. You'll fit right in.

    • by Kisai ( 213879 )

      Water is wet, ducks report.

      I know the mundane, non-tech literate people need to be told this, but every single LLM, "AI" thing out there is not intelligent. It's a trained parrot. The Parrot does not understand english, it does not "speak" it "mimics"

      • Parrots very well can understand and speak English in the range of 800 to 1000 words vocabulary.
        There is plenty of research about that.

      • There is nothing intelligent about today's so called AI. It's nothing more than massive brute forced machine learning. Sufficiently advanced tech will appear to be intelligent or even magical. When in fact it's just doing what it was programmed to do. Nothing more nothing less. Non-living beings will never be able to reason or have emotions. Machines, regardless of the amount of data and programming we toss at them or program them to consume, will never become sentient. They may appear that way in some sup

      • by mbkennel ( 97636 )

        Parrots and corvids can reason and solve novel problems.

    • Intelligence? I don’t think we even have a convincing definition of intelligence in general – let alone one sufficiently refined to allow one to distinguish between artificial and natural intelligence. That said, I bet LLMs will probably help intelligent people do great things.

    • "Reasoning" was never part of the fundamental LLM model. But if you brute force it enough it'll do something kinda cool, which is enough to get money, which is enough to get thousands upon thousands of people brute forcing it, fundamentals be damned.

      I think the only reason LLMs are so popular over other ML approaches is that LLMs self learn patterns, while many other (stronger) ML approaches require thousands or millions of labeled training samples. This also mean that LLMs try to find patterns but have nothing to tell them when they are right.

    • by dfghjk ( 711126 )

      but "reasoning" is a part of intelligence, and therefore AI would exhibit reasoning if it weren't a fraud. And there is no such thing as "the fundamental LLM". Also, you don't say "LLM model", that is redundant.

    • Yeah but "reasoning" has definitely been part of the LLM hype.
  • Reasoning (Score:5, Interesting)

    by systemd-anonymousd ( 6652324 ) on Sunday October 13, 2024 @05:58PM (#64861573)

    Future models can be improved for formalized reasoning, but what's beyond obvious at this point is that next-token text prediction is far more powerful than anyone ever imagined. Our current models out-perform graduate students and can be massive helps for professionals. It's still up to you to figure out how to integrate them into your workflow. For professional software engineering I've found them to be hugely useful, like a rubber duck that also has instant PhD level knowledge on specific tasks that often I'm learning or only familiar with. It's a productivity booster and a much better search engine, most of the time.

    • by bjoast ( 1310293 )
      No, this is Slashdot. You are supposed to think that recent advances in LLMs are completely pointless and that GPT-4o is only slightly better than Google Search was in 2001.
      • That understanding actually comes from an understanding of data structures and algorithms and knowing that is in fact a very modest advancement on the same level as search. It also comes from having lived through a number of technological revolutions, each of which featured all manner of con-artists and hucksters promising things that could not be delivered.

        That's part of what makes computer science a science. We make predictions then we test those predictions. Were those who said "That's not intelligence

      • by dfghjk ( 711126 )

        google search in 2001 gave you real results, no LLM is as good as that, much less "slightly better".

        If you're going to slag /. posters, don't prove yourself among the dumbest ones in the same sentence.

        • >google search in 2001 gave you real results

          The Web of 2024 is nothing like the web of 2001. Now we need knowledge engines, and Google fails there because it can't deal with SEO. No one's making a bunch of webring-indexed personal homepages.

    • by jythie ( 914043 )
      One does not flow from the other. yes, they will likely improve and have greater utility, but one of the classica weakness of ML based AI is its lack of symbolic reasoning and that probably isn't going to change until we get a new generation of researchers that reopen the old pre-ML models.. which is probably going to take a while since those still have the taint of uncool (and unprofitable). By the time people return to them, the people who had worked on them will probably be out of industry, so welcome
      • Why would we need a new generation of researchers for that? That's baseless

    • I do agree there. A few years ago, ChatGPT would be considered an "intern" level. Good enough to fetch stuff, but makes mistakes. Now with the newer LLMs, it is able to do more things, such as generating SCAD models from text. However, it still has a way to go, as the models it does generate may not make sense, or need cleanup.

      We have come far with the newer models, but we have come far with a lot of advances, and eventually diminishing returns hit until we go find another technology, perhaps some other

    • Hugely useful for software development?
      In specific tasks of boiler plate code generation maybe.
      The training data for these code models is open source code repositories. At best you can hope for "average" code.
      The model doesn't know which bits of code do what, or if they do it without bugs
      It doesn't know how "good" the code is.

      I'm sure its really good at reciting common examples to common questions.

      • If you're not able to articulate specific tasks, the exact inputs and outputs that are provided and what you desire, and give it enough context to understand the interoperability with your existing system, then you're not a good engineer anyway. Without that all you can expect is boilerplate or leetcode copy-paste, which is obviously not useful to a competent engineer.

        • by dfghjk ( 711126 )

          If you are a good engineer, you do the work yourself. You don't rely on software, designed to discard information and recall what's left imprecisely, to do your engineering work for you. Now, we all know that you use these "hallucinating" tools because you brag about it. On the other hand, no one says you're a good engineer but you.

          • >If you are a good engineer, you do the work yourself.

            Wrong these days. If you're a good engineer you use the tools available to you to get the task done well, and fast.

            > You don't rely on software, designed to discard information and recall what's left imprecisely, to do your engineering work for you.

            Do what you want and get left behind, then be left confused as to why it happened.

            >Now, we all know that you use these "hallucinating" tools because you brag about it.

            A moment ago you said it can only

        • So now you're just writing your code in LLM-inputs?
          May as well write in the target language.

          You're still going to have to write all the test cases
          And go through the generated code to make sure all the code paths are tested, and make sure the result is what you expect.

      • I find it useful for the kind of thing you'd be confortable pushing to an intern with stack overflow. And that's not bad.

        In no way, these LLMs have revolutionize the way I write code But for many simpel tasks, they work reasonnably well. And once you only try to leverage them in these situation you can gain decent productivity!

        For me, I find it particularly helpful to solve si ple tasks in tech I don't know well. I am an HPC scientist. So often I set up benchmarks on weird code bases, writen in a variety o

        • The code assistant I use routinely suggests completions that are accurate and that are compatible with surrounding code. Hit tab and you saved a lot of typing. I can compose a paragraph of comments describing what I want to do next and it does a fair job of just writing out the whole thing.

          • by dgatwood ( 11270 )

            The code assistant I use routinely suggests completions that are accurate and that are compatible with surrounding code. Hit tab and you saved a lot of typing. I can compose a paragraph of comments describing what I want to do next and it does a fair job of just writing out the whole thing.

            Ah, but which is faster: Writing out the paragraph of comments with enough precision for the LLM to do the right thing (or something close enough that you can massage into the right thing) or using simpler word completion/function name completion and writing the code yourself? In my experience, the latter is faster. Your mileage may vary.

            • I find that if I write out what I'm wanting to do in comments beforehand, it clarifies my thinking. Then if the assistant creates a reasonable facsimile I can work from it.

      • It helps me get a start with some boilerplate that is is 70% functional, contains 20% made up methods, and is 100% inappropriate for the given problem set, but I've still been more productive than without it. I use it to brainstorm and it never lets me down 33% of the time if you know what I mean.

        It's frustrating, infuriating even, but I can't deny I'm better of with it. It's great to have another source of unreliable nonsense mixed in with genius other than just google and my friend Dave

    • Our current models out-perform graduate students.

      That's more a statement that your tests are broken than a statement that the models are working.

    • by dfghjk ( 711126 )

      Spotted the guy too stupid to tell the difference and too arrogant to understand his own limitations. Here's the real threat of "AI", grift by stupid, ethically-challenged people.

      • With anger like that you're probably right to fear being replaced by younger, better programmers who are more able to use the tools available to them.

    • by mbkennel ( 97636 )

      They aren't good at deeper reasoning but they're good at memorizing and doing simple applications. The problem there, as a smart search engine, is that they memorize insufficiently well and don't know when they're making mistakes.

  • ELIZA? (Score:5, Insightful)

    by ctilsie242 ( 4841247 ) on Sunday October 13, 2024 @06:01PM (#64861579)

    Sometimes I wonder that even with all the nodes and capacity that modern LLMs have, we are not that far away from good old ELIZA back in the 70s. We have gone far with CPU, disk, RAM and such, but we may need to go a different route completely for AGI/ASI.

    • by jythie ( 914043 )
      As the saying goes... not matter how much money you spend, you can't turn a pig into a racehorse. You can however get an awful fast pig. ML in general has done well because it can go REALLY fast due to how it can be mapped onto cheap GPUs.. but its limitations have not changed.
    • An Eliza simply returned to you what you told it. Converting it into a question.
      Alice: Bob does not like me.
      Eliza: Why do you think that (Bob does not like you)?
      Alice: he never says a nice word.
      Eliza: tell me more about it (Eliza lost track)

      An LLM tries to answer your question, that is a difference.

  • by xtal ( 49134 )

    O1-mini easily identified the information was irrelevant and produces the correct answers .. it is also the first model with basic advanced reasoning.

    Insane resources are deployed. It is a race to when they stop getting smarter.. not there yet.

  • I havent done the test recently so the results may have changed. If you ask the llm to produce a code to solve knapsack it will produce the standard dynamic programming.
    If you describe a problem as a set of objects with weights and value and you try to select a subset of objects whose sum weight fits within a capacity and that maximizes sum of value, it produces a dynamic programming.
    If you present it as a problem where you have a set of zorglub with foo and bar prodperties and you want to select a subset

  • How could it pass a Turing test without reasoning? It seems that an important component of a test for human thinking would focus on reasoning and logic.
    • It's harder to make strong AI than trick humans. Eliza passed the Turing test in the 1960s [].

      It turns out, humans aren't hard to trick.
    • How could it pass a Turing test without reasoning? It seems that an important component of a test for human thinking would focus on reasoning and logic.

      The Turing test doesn't really test that. Not directly anyway. The test requires a human's judgement of whether an interlocutor is machine or human. All that needs to happen is that the human judge cannot tell which it is. It's not a formal test of reasoning.

  • It takes a freaking nuclear power plant to brute-force trillions of floating-point matrix operations. I'd say that's a pretty hard limit. However AI is actually going to be done, it can't be done like that. I experimented with a very simple AI rules engine decades ago, and ran right into the computational complexity wall after just a few dozen levels of reasoning. When I showed the results to the customer, he told me he wasn't looking for a machine to replace human reasoning, he wanted a machine to just run

  • "Can I get an LLM to lay out a reasoning strategy?" and "can I trip it up with words that would not trip up most postdocs?" are separate questions.

    Some of the LLM's can do the first part now beyond simple next-token generation.

    A buddy of mine asked one how many calories are in a typical male hippo and it described its plan to calculate it, did the math right, and advised against adding hippo to the diet.

  • Would it be fair to say that in all the LLMs the AI is just tallying up all the connections and patterns it sees without really "understanding" the data?
    And that the answers it gives are from a best match pattern search?
    imho the AI won't ever achieve the ability to reason. (Not saying AI doesn't have good uses)
  • So, a study takes a look at two current implementations of LLMs and proclaims that all other existing LLMs don't work and furthermore no other LLMs that will ever be created in the future will work. Why? Because all LLMs are the same, always have been, and always will be in the future. It's not like the hundreds of billions of dollars in hardware are being used to perform research into different LLM architectures, models, and processes.

    Yes, the word "prove" is incompetently used by the article writer and

    • by narcc ( 412956 )

      I'm sorry that reality ruined your silly AI fantasy. I'm amazed that it took this long.

