Apple Study Reveals Critical Flaws in AI's Logical Reasoning Abilities 72

Posted by msmash on Tuesday October 15, 2024 @03:21PM from the fault-in-our-stars dept.

Apple's AI research team has uncovered significant weaknesses in the reasoning abilities of large language models, according to a newly published study. MacRumors: The study, published on arXiv [PDF], outlines Apple's evaluation of a range of leading language models, including those from OpenAI, Meta, and other prominent developers, to determine how well these models could handle mathematical reasoning tasks. The findings reveal that even slight changes in the phrasing of questions can cause major discrepancies in model performance that can undermine their reliability in scenarios requiring logical consistency.

Apple draws attention to a persistent problem in language models: their reliance on pattern matching rather than genuine logical reasoning. In several tests, the researchers demonstrated that adding irrelevant information to a question -- details that should not affect the mathematical outcome -- can lead to vastly different answers from the models.

Apple Study Reveals Critical Flaws in AI's Logical Reasoning Abilities

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 72 Comments Log In/Create an Account

Comments Filter:

Uh - duh? (Score:3, Informative)

by peterww ( 6558522 ) writes: on Tuesday October 15, 2024 @03:25PM (#64866731)

AI does not reason. It predicts word ordering. Reasoning requires knowledge bases with semantic knowledge and analysis. Word ordering just puts jumbles of symbols in order.

- Re: (Score:3)
  
  by tysonedwards ( 969693 ) writes:
  
  An LLM is: what if I gave my smartphone keyboard autocomplete unlimited resources and trained it on everything ever written by anyone? Just like, money is no issue, give it all the processing power and memory and data... what could it do?
  
  Turns out, a lot. But, it is still fundamentally limited by the whole starting point of building the best auto-complete.
  - Re: (Score:3)
    
    by Moryath ( 553296 ) writes:
    
    "Trained" is itself the wrong terminology. "Training" implies learning, which implies intelligence. LLMs are a giant statistical-probability database with an impressive depth of connection between each individual tokenized node, but nowhere in there does any actual intelligence or reasoning ability exist.
    The whole term "artificial intelligence" is the problem. It, and the use of terms like "training," lead people to anthropomorphize what they shouldn't.
    - Re: (Score:3)
      
      by i kan reed ( 749298 ) writes:
      
      Bit of an old man yelling at clouds here. Programming relies on a lot of metaphors to help us understand the purpose of things.
      I do not think semaphores are using little colored flags to control my threads,
      which I do not believe to be strings bound on spools to divide my jobs,
      which I do not believe to be gainful employment on the part of my code.
      And:
      Objects are not things I can hold.
      Models are not toy planes
      Servers don't bring you your food
      Links are not part of a chain
      Calling functions does not require a p
    - Re: Uh - duh? (Score:2)
      
      by ArmoredDragon ( 3450605 ) writes:
      
      The whole term "artificial intelligence" is the problem
      It's a term that never really had any practical meaning other than a program that responds to inputs. In the 80s and 90s, AI was your chess opponent, which basically did fancy heuristics with a static ruleset. It never was intelligent, and still isn't. When most companies describe their product as AI, it's not even LLM, it's just a variation of the ol' chess opponent.
      Though I'd have to slightly disagree about your training comment. For LLM, yes, it's not training so much as just adding data points for the d
      - Re: Uh - duh? (Score:2)
        
        by ArmoredDragon ( 3450605 ) writes:
        
        Bleh, correction: Solely NN
    - Re: (Score:2)
      
      by Ksevio ( 865461 ) writes:
      
      "Training" is an accurate and correct term to use here. Not only is it the common terminology in the machine learning field for decades, but it describes what is happening.
      LLM models aren't just databases, they're weighed neural networks that will produce a given result based on a given input. The training is to adjust the weights to properly produce the result. Without the training, the model produces gibberish.
    - Re: (Score:2)
      
      by jythie ( 914043 ) writes:
      
      One of the issues is that what has become known as AI, well, isn't. The stats people adopted it when they found the cool factor brought more attention and VC funding.
    - Re: (Score:2)
      
      by UnknownSoldier ( 67820 ) writes:
      
      That's why I call it artificial ignorance.
      You would have to program it to be this stupid. [reddit.com] /s
Reason (Score:1)

by JBMcB ( 73720 ) writes:

There is no reasoning. It's pattern matching based on keywords and weights feeding into Markov chains. Most LLMs also have some inferencing ability hardwired in there by humans, but they don't make those inferences on their own.
- Re:Reason (Score:5, Insightful)
  
  by Baron_Yam ( 643147 ) writes: on Tuesday October 15, 2024 @03:32PM (#64866767)
  
  The funny thing is... Somehow our ability to reason is an emergent property of weighted connections in a network. Because we don't understand how that happens, we don't know why it isn't happening with the AI we have created, or if it's even possible with the setups we're using. We also don't know if it's impossible for a sufficiently complex version of an existing AI system to do it.
  Probably impossible, I suspect there's more than just 'embiggen it and it will happen'.
  
  - Re: (Score:3)
    
    by alvinrod ( 889928 ) writes:
    
    A traditional neural network isn't the best approximation of actual neurons, so we don't get something that works in quite the same way. The hardware we run these programs on isn't like an actual brain either. However, when we do create software that actually models a physical brain, it does behave like one. There have been some studies conducted to recreate a worm brain in software as it has a smaller number of neurons and can be easily mapped out since dissecting worms isn't going to raise many eyebrows.
    - Re: (Score:1)
      
      by CalgaryD ( 9235067 ) writes:
      
      I want to up vote this as informative and insightful at the same time.
      - Re: (Score:2)
        
        by martin-boundary ( 547041 ) writes:
        
        I first want to see the papers the grandparent comment is based on.
        
        Re: (Score:1)
        
        by CalgaryD ( 9235067 ) writes:
        
        I am not sure there have been studies like that, but not everything really needs a study to be correct. It is obvious, but not many see this.
        It you have a mind, that was trained over many years to take care, protect, feed, its own physical shell. How much worth would be this mind without this shell? How much similar will be another mind that was trained without any need to take care of its physical shell? Without that shell? Without eyes, without ears, without hands and legs? I am sure, it will be quite d
        
        Re: (Score:2)
        
        by martin-boundary ( 547041 ) writes:
        
        If such studies existed it would make the argument more persuasive. The accurate simulation of even the full synaptic structure of a worm seems beyond our capabilities at present. Unlike a (synchronous) neural network in a computer, a biological brain operates asynchronously via chemical charges and discharges. We know almost nothing about how to do reliable computations in this way, it is like having a regular computer without a clock.
  - Idiocracy bucket problems (Score:2)
    
    by goombah99 ( 560566 ) writes:
    
    If I gave you a 5 gallon bucket and a 2 gallon bucket, how many buckets did I give you?
  - Re: (Score:2)
    
    by jythie ( 914043 ) writes:
    
    Implemented on and emergency property of are not necessarily the same thing.
  - Re: (Score:2)
    
    by giampy ( 592646 ) writes:
    
    >> Somehow our ability to reason is an emergent property of weighted connections in a network.
    Our weighted network has been honed by hundreds of millions of years or evolution, it's not just random.
    But even more importantly, the ability to reason is also an emergent property of years and years of the network _actually interacting with the environment_ through seeing, touching, hearing and so on. That is VERY different than just seeing and predicting words.
  - Re: (Score:2)
    
    by martin-boundary ( 547041 ) writes:
    
    Your error here is that you are begging the question.
    By stating "Somehow our ability to reason is an emergent property of weighted connections in a network" you are making a very strong assumption about what produces the ability to reason in humans. Begging the question is another way to say you are assuming the conclusion.
    Once you assume the strong conclusion that reason is an emergent property of a simple kind of network, then you are trapped in the paradoxical observation that reason is not in fact em
  - Re: (Score:2)
    
    by strikethree ( 811449 ) writes:
    
    we don't know why it isn't happening with the AI we have created, or if it's even possible with the setups we're using.
    It is explicitly impossible to create reasoning out of the "AI" techniques that we have today. The techniques that we have will likely be a part of a 'reasoning' AI, but there is nowhere near enough infrastructure to support reasoning currently.
- Re: (Score:2)
  
  by narcc ( 412956 ) writes:
  
  There is no reasoning.
  Correct.
  It's pattern matching based on keywords and weights feeding into Markov chains.
  Incorrect. LLMs are non-Markovian.
  - Re: (Score:3)
    
    by martin-boundary ( 547041 ) writes:
    
    Sorry, you are wrong. The LLMs are in fact Markovian, since they have a finite state space and a bounded input memory.
    Even if you were to consider RAGs instead of pure chatbots, you would still have a Markov chain in an external environment, which makes them Markov Decision Processes (MDPs or even POMDPs).
    The important bit is the Markovian structure, which arises because the input window is limited and the system has well defined state transition probabilities.
    - Re: (Score:2)
      
      by narcc ( 412956 ) writes:
      
      You should get a refund for your CS degree. It's clearly defective.
      LLMs very obviously violate the Markov property: P(Xt+1 = s|Xt, ..., X0=s0) = P(Xt+1 = x|Xt=st)
      More on this after the break...
      Even if you were to consider RAGs instead of pure chatbots, you would still have a Markov chain in an external environment
      Not by according to you. By introducing RAG, you no longer have "well-defined state transition probabilities" and your "input window" is arguably extended indefinitely. Oops!
      You're making the same mistake that every undergrad makes when they confuse the practical with the theoretical and completely miss the point.
      - Re: (Score:2)
        
        by martin-boundary ( 547041 ) writes:
        
        Are you for real? If you don't even know the basic definition of a Markov process you shouldn't comment. Maybe take a first year course in probability and come back next year.
        The formula you give is not the correct definition, you're missing a whole class of other cases. But given your attitude, I'll let you figure it out for yourself.
        
        Re: (Score:2)
        
        by narcc ( 412956 ) writes:
        
        I'll let you figure it out for yourself.
        Translation: "I can't contradict anything you've written, but I don't want to admit that I was wrong"
        What a joke!
        
        Re: (Score:2)
        
        by martin-boundary ( 547041 ) writes:
        
        Why should I educate you?
Dupe di-dupe di-dupe di-dupe dupe dupe (Score:5, Insightful)

by gweihir ( 88907 ) writes: on Tuesday October 15, 2024 @03:26PM (#64866735)

And please stop claiming "faults" in "LLM reasoning abilities". LLMs have no reasoning abilities and pattern matching is not a valid substitute.

- Re: (Score:1)
  
  by Whateverthisis ( 7004192 ) writes:
  
  I think it's fair for Apple to point this out and use "LLM reasoning abilities". You're right in what you're saying, but when you have people who are claiming they're on the path to making "General Artificial Intelligence", or we're "4 years away from AI that will eliminate 50% of jobs", the suggestion is that the AI is actually able to reason; that it's truly intelligent. So it's good that someone with the right resources and the ability to know what's going on can use the language of those hyping AI and
  - Re: (Score:2)
    
    by gweihir ( 88907 ) writes:
    
    Hmm. I do admit I sometimes forget the low "reasoning ability" level many people operate on.
- Re:Dupe di-dupe di-dupe di-dupe dupe dupe (Score:4, Interesting)
  
  by war4peace ( 1628283 ) writes: on Tuesday October 15, 2024 @03:53PM (#64866827)
  
  Question is, do Slashdot editors have enough reasoning abilities, considering the dupefest here?
  
  - Re: (Score:2)
    
    by sconeu ( 64226 ) writes:
    
    Maybe Apple could do a study revealing the critical flaws in Slashdot editors' "reasoning" abilities?
  - Re: (Score:2)
    
    by phfpht ( 654492 ) writes:
    
    NO. But, being just a bad as humans is not a validation of generative AI.
- Re: (Score:2)
  
  by toxonix ( 1793960 ) writes:
  
  The headline/description is garbage.
  but
  Apple needs to temper peoples expectations when Sam Altman is writing things like:
  " ... it’s very possible that creativity and what we think of us as human intelligence are just an emergent property of a small number of algorithms operating with a lot of compute power"
  and
  "We decry current machine intelligence as cheap tricks, but perhaps our own intelligence is just the emergent combination of a bunch of cheap tricks."
  Even Mira Murati's papers point in th
This needed a study? (Score:2, Offtopic)

by nightflameauto ( 6607976 ) writes:

I expect we'll see a response from Sam Altman and his ilk within days talking about how reasoning ability is overrated anyway, and the artificial intelligence is superior to supposed "real" intelligence on such a level that we simply aren't equipped to understand the reasoning ability of such a superior creation.
My god, this is stupid. Reasoning ability in LLMs? Just as well say every database in existence has reasoning ability just because you can type a somewhat english looking phrase in (SELECT * FROM $F
- Re: (Score:1)
  
  by iAmWaySmarterThanYou ( 10095012 ) writes:
  
  Sorry, but you went there. I couldn't resist.
  https://xkcd.com/327/ [xkcd.com]
Non deterministic (Score:1)

by cygnusvis ( 6168614 ) writes:

The non deterministic nature of AI language models make it impossible to make guarantees and its results cannot be insured financially or legally. For example, If the AI sends a 1 in a million mass email that is highly offensive, the AI producer/maintainer probably has language stating they're not liable.
- Re: (Score:3)
  
  by dfghjk ( 711126 ) writes:
  
  What "non deterministic nature"? And why are "guarantees" of "results" important?
  "For example, If the AI sends a 1 in a million mass email that is highly offensive, the AI producer/maintainer probably has language stating they're not liable."
  They'll have that anyway. It's a problem of legal accountability, not a characteristic of LLMs that you cannot accurately describe.
- Re: (Score:2)
  
  by omnichad ( 1198475 ) writes:
  
  I believe that they are fully deterministic, but generation runs are seeded with random numbers intentionally.
  - Re: (Score:1)
    
    by iAmWaySmarterThanYou ( 10095012 ) writes:
    
    2+2=4. Fully deterministic. Always yields same results given 2+2=? as input.
    Vs.
    LLM given same user input multiple times yielding different results each time? Non deterministic.
    By definition if a random number generator is a key part of your algorithm, it is not deterministic. This should be self evident.
    - Re: (Score:2)
      
      by omnichad ( 1198475 ) writes:
      
      I'll use image generators as an example because even though it's a different algorithm, they work in a lot of the same ways.
      You put in a text prompt and get a different image every time, right? No. You can re-run the same prompt with the same seed and get exactly the same picture out of it. You just have to have control over the model to enable that. So maybe not Bing Image Generator but definitely Stable Diffusion.
      It's pseudorandom numbers, so yes - it's deterministic.
    - Re: (Score:2)
      
      by narcc ( 412956 ) writes:
      
      LLM given same user input multiple times yielding different results each time?
      
      If you're talking about giving the same prompt to a chat bot twice in a row, I should point out that you're not presenting the same input multiple times.
      By definition if a random number generator is a key part of your algorithm, it is not deterministic.
      Slow down there, cowboy. We need to have a little talk about determinism, random number generators, and what constitutes an input. For clarity, when discussion LLMs specifically, I'm going to separate the model, the bit that spits out a list probabilities, from the bit that selects the output token and the loop.
      A function is deterministic if it always pro
      - Re: (Score:1)
        
        by iAmWaySmarterThanYou ( 10095012 ) writes:
        
        Ok, fair point when taken at the unit level. I'll buy that. Yes, if I control the entire system, including any random seeds, the model doesn't change from previous inputs, etc, then yes, I should get the same output each time. Agreed.
        But, if we look at the things from the outside the way most users will interact, the system as a whole is not under their control and the output they get for the same input is not guaranteed to be the same on subsequent runs.
        So, yes, to a researcher they are (or can be made
      - Re: (Score:2)
        
        by az-saguaro ( 1231754 ) writes:
        
        I was going to post roughly the same thing, but you covered it so well, no need. I hope you get modded up.
        But, it made me think of a few additional points:
        1
        I have a post below, "Similarity to fractals and NLD" in which I explain that I have no experience with language AI, but I do with imagery AI. So, going off of that, when a stable diffusion image is generated, it starts with a field of randomly generated noise, then the prompts and the language model seek patterns emerging. That gives you roughly 1024
- Re: (Score:2)
  
  by alvinrod ( 889928 ) writes:
  
  All computer programs are deterministic unless they use external phenomena to control their execution. Just because they're so big and complex that we can't easily work out their state doesn't mean they've stopped being deterministic.
  - Re: (Score:1)
    
    by iAmWaySmarterThanYou ( 10095012 ) writes:
    
    $x = time();
    Print $x
    Deterministic?
    LLM using time() or rand()... deterministic?
    - Re: (Score:2)
      
      by narcc ( 412956 ) writes:
      
      Yes. They always produce the same output for any given input. That makes them deterministic.
Again? (Score:2)

by ebcdic ( 39948 ) writes:

They did the same a couple of days ago:
https://apple.slashdot.org/sto... [slashdot.org]
- Re: (Score:2)
  
  by war4peace ( 1628283 ) writes:
  
  Apple is thorough.
  Slashdot editors, not so much.
- Re: (Score:2)
  
  by bill_mcgonigle ( 4333 ) * writes:
  
  "Hey, LLM, has this article been posted already?"
  See, AI could improve /. Maybe it's only as smart as a cat but if that cat can spot dupes that's something editors miss.
  Humans use cats to hunt mice too. Not because cats are good at anything else but being mean, but they excel at that. Same with LLM pattern matching.
  Apple always shits on tech they're way behind on - until they "revolutionize" it and it's the next best thing. Remember when fanbois were worshiping the Lightning Cable?
  They'll snap-to on AI
Not able to reason. (Score:2)

by eriks ( 31863 ) writes:

"Generative AI" is simply not capable of what we would universally consider reasoning. LLMs and other "reflexive" pattern-matching systems may be a stepping stone on the way to AGI, or, they may be a cul-de-sac, and won't have anything at all to do with AGI, if such a thing ever comes to be.
I really question this. (Score:2)

by javaman235 ( 461502 ) writes:

I mean, take any formal math proof. You have a set of transformations you can make to existing statements, a set of existing statements, and you apply them to get the form you want. All of this is realizable within a neural network, so any output can only be the product of an input plus a transformation.
More Discussion on this from 2 Days Ago (Score:2)

by serutan ( 259622 ) writes:

https://apple.slashdot.org/sto... [slashdot.org]
The critical flaw is that... (Score:2)

by MpVpRb ( 1423381 ) writes:

...they have NO reasoning ability
It's all statistics and clever math
I tried it, they are right (Score:1)

by nospam007 ( 722110 ) * writes:

I ask how much is 3+5?
If I change just one character, the '3' to '4', I get a completely different answer.
Novel thought (Score:2)

by fluffernutter ( 1411889 ) writes:

When they learn merely by the words that other people have posted, their 'reasoning' can only be a logical calculation within the domain of what other people have said. But 'reasoning' in the way the term is meant means novel thought, and therein lies the rub.
IQ Test (Score:3)

by RossCWilliams ( 5513152 ) writes: on Tuesday October 15, 2024 @05:02PM (#64867019)

Have AI take an IQ test. That's the way we determine "intelligence" in humans. If you want to define it differently you need to come up with a different measure. Or admit you are arguing about an ill-defined term that mostly is used to describe how well someone's thinking conforms to a particular social class.

Similarity to fractals and NLD (Score:2)

by az-saguaro ( 1231754 ) writes:

I have not studied AI in great technical detail. I am not completely stupid on the subject, but just pretty basic. My comment here comes not from detailed knowledge or insight about AI, but from something that I do know intimately and deeply well - fractals and non-linear dynamics.
Also, I have not used ChatGPT or any other text writing AI agent. But, I have messed around a bit with the image generators, to see what it is all about. On the image-generation subject, I have studied the AI process a little
Have they tried no changes? (Score:2)

by WaffleMonster ( 969671 ) writes:

In my experience you don't even need to change prompts to get wildly different results. Simply dump context and try again.
Should Have Read... (Score:1)

by slacdude ( 1944082 ) writes:

AI Study Reveals Critical Flaws in Apple’s Logical Reasoning Abilities

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Uh - duh? (Score:3, Informative)

Re: (Score:3)

Re: (Score:3)

Re: (Score:3)

Re: Uh - duh? (Score:2)

Re: Uh - duh? (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Reason (Score:1)

Re:Reason (Score:5, Insightful)

Re: (Score:3)

Re: (Score:1)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Idiocracy bucket problems (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Dupe di-dupe di-dupe di-dupe dupe dupe (Score:5, Insightful)

Re: (Score:1)

Re: (Score:2)

Re:Dupe di-dupe di-dupe di-dupe dupe dupe (Score:4, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

This needed a study? (Score:2, Offtopic)

Re: (Score:1)

Non deterministic (Score:1)

Re: (Score:3)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Again? (Score:2)

Re: (Score:2)

Re: (Score:2)

Not able to reason. (Score:2)

I really question this. (Score:2)

More Discussion on this from 2 Days Ago (Score:2)

The critical flaw is that... (Score:2)

I tried it, they are right (Score:1)

Novel thought (Score:2)

IQ Test (Score:3)

Similarity to fractals and NLD (Score:2)

Have they tried no changes? (Score:2)

Should Have Read... (Score:1)

Related Links Top of the: day, week, month.

Slashdot Top Deals