Study Done By Apple AI Scientists Proves LLMs Have No Ability to Reason (appleinsider.com) 85
Slashdot reader Rick Schumann shared this report from the blog AppleInsider:
A new paper from Apple's artificial intelligence scientists has found that engines based on large language models, such as those from Meta and OpenAI, still lack basic reasoning skills.
The group has proposed a new benchmark, GSM-Symbolic, to help others measure the reasoning capabilities of various large language models (LLMs). Their initial testing reveals that slight changes in the wording of queries can result in significantly different answers, undermining the reliability of the models. The group investigated the "fragility" of mathematical reasoning by adding contextual information to their queries that a human could understand, but which should not affect the fundamental mathematics of the solution. This resulted in varying answers, which shouldn't happen...
The study found that adding even a single sentence that appears to offer relevant information to a given math question can reduce the accuracy of the final answer by up to 65 percent. "There is just no way you can build reliable agents on this foundation, where changing a word or two in irrelevant ways or adding a few bit of irrelevant info can give you a different answer," the study concluded... "We found no evidence of formal reasoning in language models," the new study concluded. The behavior of LLMS "is better explained by sophisticated pattern matching" which the study found to be "so fragile, in fact, that [simply] changing names can alter results."
The group has proposed a new benchmark, GSM-Symbolic, to help others measure the reasoning capabilities of various large language models (LLMs). Their initial testing reveals that slight changes in the wording of queries can result in significantly different answers, undermining the reliability of the models. The group investigated the "fragility" of mathematical reasoning by adding contextual information to their queries that a human could understand, but which should not affect the fundamental mathematics of the solution. This resulted in varying answers, which shouldn't happen...
The study found that adding even a single sentence that appears to offer relevant information to a given math question can reduce the accuracy of the final answer by up to 65 percent. "There is just no way you can build reliable agents on this foundation, where changing a word or two in irrelevant ways or adding a few bit of irrelevant info can give you a different answer," the study concluded... "We found no evidence of formal reasoning in language models," the new study concluded. The behavior of LLMS "is better explained by sophisticated pattern matching" which the study found to be "so fragile, in fact, that [simply] changing names can alter results."
Duh (Score:5, Insightful)
Re:Duh (Score:5, Insightful)
You beat me to it. LLMs cannot reason and will never be able to reason. The very approach does not allow it. Obviously, it can regurgitate reasoning steps it has seen in its training data with not very high reliability, but that is not reasoning. That is faking it. Since most people only have very limited or no reasoning ability themselves, they are then deeply impressed by the fake.
Re: (Score:2, Troll)
You beat me to it. LLMs cannot reason and will never be able to reason. The very approach does not allow it. Obviously, it can regurgitate reasoning steps it has seen in its training data with not very high reliability, but that is not reasoning. That is faking it. Since most people only have very limited or no reasoning ability themselves, they are then deeply impressed by the fake.
Fundamentally I agree, but devils advocate here. You haven't said what this "reasoning" thing is. How do you know it isn't just a feature of a larger model?
What they have done here is to assert the existence of a class of problems which they say require reasoning to solve efficiently. They can create a more or less infinite set of such examples and manipulate them so that systems with "reasoning" (say graduate students as in comments below) can solve them easily and quickly and LLMs cannot.
That's a reasonab
Re: (Score:2)
Re: (Score:2)
Yeah this is where my philosophy grad brain kicks in to protest a little bit.
Theres a whole series of words used in AI, and by the public in general that are "you know what I mean" fuzzy words. But these words are *terrible* for the practice of science.
Take everyones favorite "consciousness". You know intuitively what it is, your doing it right now, but try and define it? Not so easy without ending up in tautological loops. The best we can really come up with is something like "paying attention to something
Re:Duh (Score:5, Interesting)
It's even worse than that.
As more people use LLMs, more content will be LLM generated - and LLMs can't tell the difference between generated content (which is inappropriate to train on), and "real" content that may actually include some percentage of actual reasoning steps that could be used to fake some % of reason.
AI will ultimately destroy itself by devouring its own tail.
Re: (Score:2)
how do you know that generated content is "inappropriate to train on"? AI is in its infancy and every aspect is under rapid development. Some say that generated content may be problematic, but so called experts demonstrate a lot of basic ignorance constantly. For all we know, that could soon not be a problem at all, it's certainly not a problem in this context.
Re: (Score:3)
how do you know that generated content is "inappropriate to train on"?
This is a well-understood phenomenon. It's also intuitively obvious. I even described it in an off-hand way here months before the paper that coined the term "model collapse".
but so called experts demonstrate a lot of basic ignorance constantly
So... we should listen to people without any relevant knowledge or experience? If that's what you're after, LessWrong has no shortage of uneducated crackpots. You'll fit right in.
Re: (Score:2)
Water is wet, ducks report.
I know the mundane, non-tech literate people need to be told this, but every single LLM, "AI" thing out there is not intelligent. It's a trained parrot. The Parrot does not understand english, it does not "speak" it "mimics"
Re: (Score:2)
Parrots very well can understand and speak English in the range of 800 to 1000 words vocabulary.
There is plenty of research about that.
Re: Duh (Score:2)
There is nothing intelligent about today's so called AI. It's nothing more than massive brute forced machine learning. Sufficiently advanced tech will appear to be intelligent or even magical. When in fact it's just doing what it was programmed to do. Nothing more nothing less. Non-living beings will never be able to reason or have emotions. Machines, regardless of the amount of data and programming we toss at them or program them to consume, will never become sentient. They may appear that way in some sup
Re: (Score:2)
Parrots and corvids can reason and solve novel problems.
Maybe they just named it wrong (Score:1)
Intelligence? I don’t think we even have a convincing definition of intelligence in general – let alone one sufficiently refined to allow one to distinguish between artificial and natural intelligence. That said, I bet LLMs will probably help intelligent people do great things.
Re: (Score:3)
"Reasoning" was never part of the fundamental LLM model. But if you brute force it enough it'll do something kinda cool, which is enough to get money, which is enough to get thousands upon thousands of people brute forcing it, fundamentals be damned.
I think the only reason LLMs are so popular over other ML approaches is that LLMs self learn patterns, while many other (stronger) ML approaches require thousands or millions of labeled training samples. This also mean that LLMs try to find patterns but have nothing to tell them when they are right.
Re: (Score:2)
but "reasoning" is a part of intelligence, and therefore AI would exhibit reasoning if it weren't a fraud. And there is no such thing as "the fundamental LLM". Also, you don't say "LLM model", that is redundant.
Re: (Score:2)
Reasoning (Score:5, Interesting)
Future models can be improved for formalized reasoning, but what's beyond obvious at this point is that next-token text prediction is far more powerful than anyone ever imagined. Our current models out-perform graduate students and can be massive helps for professionals. It's still up to you to figure out how to integrate them into your workflow. For professional software engineering I've found them to be hugely useful, like a rubber duck that also has instant PhD level knowledge on specific tasks that often I'm learning or only familiar with. It's a productivity booster and a much better search engine, most of the time.
Re: (Score:3)
Re: (Score:2)
Yep, too stupid to tell the difference, too arrogant to understand, ethically-challenged.
Re: (Score:2)
That understanding actually comes from an understanding of data structures and algorithms and knowing that is in fact a very modest advancement on the same level as search. It also comes from having lived through a number of technological revolutions, each of which featured all manner of con-artists and hucksters promising things that could not be delivered.
That's part of what makes computer science a science. We make predictions then we test those predictions. Were those who said "That's not intelligence
Re: (Score:2)
>data structures and algorithms
Tell me you don't do professional engineering without telling me you don't do professional engineering.
Re: (Score:2)
google search in 2001 gave you real results, no LLM is as good as that, much less "slightly better".
If you're going to slag /. posters, don't prove yourself among the dumbest ones in the same sentence.
Re: (Score:2)
>google search in 2001 gave you real results
The Web of 2024 is nothing like the web of 2001. Now we need knowledge engines, and Google fails there because it can't deal with SEO. No one's making a bunch of webring-indexed personal homepages.
Re: (Score:2)
Re: (Score:1)
Water used for cooling is 100% recyclable and energy comes for free from the sky and atoms.
Re: (Score:2)
So ban on evaporative cooling as well as using aquifer water should not upset anyone then I should hope as outside of those things i laregly agree, a closed loop, water-to-air cooling system is quite efficent these days.
Re: Reasoning (Score:2)
Yes and no.
Water used in watercooling is often released back in the environment. Usually in a stream the water was pumped from. Though it seems you do need to be careful HOW you release it The water comes out quite hot and if you dump it in the environment hot, it actually damages the ecosystem of the river. So often the water sits in cooling coils before being dumped back out at a more reasonnable delta.
Few installation seem to operate purely on a water closed loop. I am guessing it is because it is more
Re: (Score:2)
Re: (Score:2)
Those resources could be used to serve PEOPLE instead of this shit.
In our current epoch, LLMs are feasible and therefore worth investigating. Who knows what benefits they may bring, to serve the people we both care about?
There's a story (perhaps apocryphal) of then-British-PM Benjamin Disraeli visiting the laboratory of Michael Faraday, and asking Faraday "what use is electricity?" Faraday replied, "Prime Minister, what use is a newborn baby?"
Re: (Score:2)
Why would we need a new generation of researchers for that? That's baseless
Re: (Score:2)
I do agree there. A few years ago, ChatGPT would be considered an "intern" level. Good enough to fetch stuff, but makes mistakes. Now with the newer LLMs, it is able to do more things, such as generating SCAD models from text. However, it still has a way to go, as the models it does generate may not make sense, or need cleanup.
We have come far with the newer models, but we have come far with a lot of advances, and eventually diminishing returns hit until we go find another technology, perhaps some other
Re: (Score:2)
Hugely useful for software development?
In specific tasks of boiler plate code generation maybe.
The training data for these code models is open source code repositories. At best you can hope for "average" code.
The model doesn't know which bits of code do what, or if they do it without bugs
It doesn't know how "good" the code is.
I'm sure its really good at reciting common examples to common questions.
Re: (Score:2)
If you're not able to articulate specific tasks, the exact inputs and outputs that are provided and what you desire, and give it enough context to understand the interoperability with your existing system, then you're not a good engineer anyway. Without that all you can expect is boilerplate or leetcode copy-paste, which is obviously not useful to a competent engineer.
Re: (Score:2)
If you are a good engineer, you do the work yourself. You don't rely on software, designed to discard information and recall what's left imprecisely, to do your engineering work for you. Now, we all know that you use these "hallucinating" tools because you brag about it. On the other hand, no one says you're a good engineer but you.
Re: (Score:2)
>If you are a good engineer, you do the work yourself.
Wrong these days. If you're a good engineer you use the tools available to you to get the task done well, and fast.
> You don't rely on software, designed to discard information and recall what's left imprecisely, to do your engineering work for you.
Do what you want and get left behind, then be left confused as to why it happened.
>Now, we all know that you use these "hallucinating" tools because you brag about it.
A moment ago you said it can only
Re: (Score:2)
So now you're just writing your code in LLM-inputs?
May as well write in the target language.
You're still going to have to write all the test cases
And go through the generated code to make sure all the code paths are tested, and make sure the result is what you expect.
Re: Reasoning (Score:2)
I find it useful for the kind of thing you'd be confortable pushing to an intern with stack overflow. And that's not bad.
In no way, these LLMs have revolutionize the way I write code But for many simpel tasks, they work reasonnably well. And once you only try to leverage them in these situation you can gain decent productivity!
For me, I find it particularly helpful to solve si ple tasks in tech I don't know well. I am an HPC scientist. So often I set up benchmarks on weird code bases, writen in a variety o
Re: (Score:2)
The code assistant I use routinely suggests completions that are accurate and that are compatible with surrounding code. Hit tab and you saved a lot of typing. I can compose a paragraph of comments describing what I want to do next and it does a fair job of just writing out the whole thing.
Re: (Score:2)
The code assistant I use routinely suggests completions that are accurate and that are compatible with surrounding code. Hit tab and you saved a lot of typing. I can compose a paragraph of comments describing what I want to do next and it does a fair job of just writing out the whole thing.
Ah, but which is faster: Writing out the paragraph of comments with enough precision for the LLM to do the right thing (or something close enough that you can massage into the right thing) or using simpler word completion/function name completion and writing the code yourself? In my experience, the latter is faster. Your mileage may vary.
Re: (Score:2)
I find that if I write out what I'm wanting to do in comments beforehand, it clarifies my thinking. Then if the assistant creates a reasonable facsimile I can work from it.
Re: (Score:2)
It helps me get a start with some boilerplate that is is 70% functional, contains 20% made up methods, and is 100% inappropriate for the given problem set, but I've still been more productive than without it. I use it to brainstorm and it never lets me down 33% of the time if you know what I mean.
It's frustrating, infuriating even, but I can't deny I'm better of with it. It's great to have another source of unreliable nonsense mixed in with genius other than just google and my friend Dave
Re: (Score:2)
Our current models out-perform graduate students.
That's more a statement that your tests are broken than a statement that the models are working.
Re: (Score:2)
>T-t-the tests are broken!
When you don't like the results just claim the tests are wrong. Like any social "scientist" does.
https://arxiv.org/abs/2311.120... [arxiv.org]
Re: (Score:1)
Spotted the guy too stupid to tell the difference and too arrogant to understand his own limitations. Here's the real threat of "AI", grift by stupid, ethically-challenged people.
Re: (Score:2)
With anger like that you're probably right to fear being replaced by younger, better programmers who are more able to use the tools available to them.
Re: (Score:2)
They aren't good at deeper reasoning but they're good at memorizing and doing simple applications. The problem there, as a smart search engine, is that they memorize insufficiently well and don't know when they're making mistakes.
Re: Thanks science (Score:2)
Yeah, right now AI developpers are getting paid stupid amounts. But my guess is that most willpivot or be fired in the next two years.
ELIZA? (Score:5, Insightful)
Sometimes I wonder that even with all the nodes and capacity that modern LLMs have, we are not that far away from good old ELIZA back in the 70s. We have gone far with CPU, disk, RAM and such, but we may need to go a different route completely for AGI/ASI.
Re: (Score:2)
Since you asked...
ELIZA is/was a program that rather simple-mindedly tried to pass the Turing test by asking general questions and mimicking back certain phrases input by the human interlocutor. It was called ELIZA because the human was instructed to treat the program as though it was a therapist, thus leading to the exchange of phrases with no expected deep-sharing of personal info from the program. It was modestly impressive, in a low-bar 1970s kind of way.
No offense meant to therapists here.
Tell me more...
Oh wait, are
Re: ELIZA? (Score:2)
How do you feel about whoosh?
Re: (Score:1)
That is all wrong.
People talking with an Eliza, usually did not know it was a program.
The idea was to catch the person into a conversation, and let her reflect about everything themselves.
An Eliza, for English language, is perhaps a 5 hundred lines of C code.
I once put one into an IRC channel, which spawned a copy of itself, and kept a history for everyone talking to her.
People accused me, that I would type her answers.
She was basically an IRC bot, that hold game relevant info. And greeted people like: "Yes
Re: (Score:2)
Re: (Score:1)
An Eliza simply returned to you what you told it. Converting it into a question.
Alice: Bob does not like me.
Eliza: Why do you think that (Bob does not like you)?
Alice: he never says a nice word.
Eliza: tell me more about it (Eliza lost track)
An LLM tries to answer your question, that is a difference.
Re: wrong study (Score:1)
Cope (Score:2)
O1-mini easily identified the information was irrelevant and produces the correct answers .. it is also the first model with basic advanced reasoning.
Insane resources are deployed. It is a race to when they stop getting smarter.. not there yet.
I have also found them quite sensitive to wording (Score:3)
I havent done the test recently so the results may have changed. If you ask the llm to produce a code to solve knapsack it will produce the standard dynamic programming.
If you describe a problem as a set of objects with weights and value and you try to select a subset of objects whose sum weight fits within a capacity and that maximizes sum of value, it produces a dynamic programming.
If you present it as a problem where you have a set of zorglub with foo and bar prodperties and you want to select a subset of the zorglub so that their sum foo is smaller than grandduck and to maximize the sum of bar. It gives you a brute force algorithm.
So clearly it takes it cues not from the structure of the problem but from the names that get used. Which, fair enough, doesn't mean it is not useful. But virtually all the students in my algorithm class would go: "isn't that just knapsack?"
Re: I have also found them quite sensitive to word (Score:2)
Playing devil's advocate, isn't your class just matching patterns with a much bigger training set than your LLM was?
Re: I have also found them quite sensitive to wor (Score:2)
Actually, I don't think that's what it is I think they'd recognize the mathematical structure of a knapsack problem. And they'll ignore the wording.
Because they know that when looking at these kinds of problems, you have to extract input, constraints, decisions, and objective. And then solve from there.
And once you are in that space, it should be obvious it is knapsack
Turing (Score:2)
Re: (Score:2)
It turns out, humans aren't hard to trick.
Re: Turing (Score:1)
Turns out thereâ(TM)s a not insignificant percentage of humans that canâ(TM)t reason either.
Re: (Score:2)
How could it pass a Turing test without reasoning? It seems that an important component of a test for human thinking would focus on reasoning and logic.
The Turing test doesn't really test that. Not directly anyway. The test requires a human's judgement of whether an interlocutor is machine or human. All that needs to happen is that the human judge cannot tell which it is. It's not a formal test of reasoning.
LLM brute-force methods have hard limits (Score:2)
It takes a freaking nuclear power plant to brute-force trillions of floating-point matrix operations. I'd say that's a pretty hard limit. However AI is actually going to be done, it can't be done like that. I experimented with a very simple AI rules engine decades ago, and ran right into the computational complexity wall after just a few dozen levels of reasoning. When I showed the results to the customer, he told me he wasn't looking for a machine to replace human reasoning, he wanted a machine to just run
Separate Issues (Score:2)
"Can I get an LLM to lay out a reasoning strategy?" and "can I trip it up with words that would not trip up most postdocs?" are separate questions.
Some of the LLM's can do the first part now beyond simple next-token generation.
A buddy of mine asked one how many calories are in a typical male hippo and it described its plan to calculate it, did the math right, and advised against adding hippo to the diet.
pattern recognition engine (Score:2)
And that the answers it gives are from a best match pattern search?
imho the AI won't ever achieve the ability to reason. (Not saying AI doesn't have good uses)
Incompetent use of the word "prove" (Score:2)
So, a study takes a look at two current implementations of LLMs and proclaims that all other existing LLMs don't work and furthermore no other LLMs that will ever be created in the future will work. Why? Because all LLMs are the same, always have been, and always will be in the future. It's not like the hundreds of billions of dollars in hardware are being used to perform research into different LLM architectures, models, and processes.
Yes, the word "prove" is incompetently used by the article writer and
Re: (Score:2)
I'm sorry that reality ruined your silly AI fantasy. I'm amazed that it took this long.