Apple Researchers Challenge AI Reasoning Claims With Controlled Puzzle Tests 71

Posted by msmash on Monday June 09, 2025 @09:00AM from the closer-look dept.

Apple researchers have found that state-of-the-art "reasoning" AI models like OpenAI's o3-mini, Gemini (with thinking mode-enabled), Claude 3.7, DeepSeek-R1 face complete performance collapse [PDF] beyond certain complexity thresholds when tested on controllable puzzle environments. The finding raises questions about the true reasoning capabilities of large language models.

The study, which examined models using Tower of Hanoi, checker jumping, river crossing, and blocks world puzzles rather than standard mathematical benchmarks, found three distinct performance regimes that contradict conventional assumptions about AI reasoning progress.

At low complexity levels, standard language models surprisingly outperformed their reasoning-enhanced counterparts while using fewer computational resources. At medium complexity, reasoning models demonstrated advantages, but both model types experienced complete accuracy collapse at high complexity levels. Most striking was the counterintuitive finding that reasoning models actually reduced their computational effort as problems became more difficult, despite operating well below their token generation limits.

Even when researchers provided explicit solution algorithms, requiring only step-by-step execution rather than creative problem-solving, the models' performance failed to improve significantly. The researchers noted fundamental inconsistencies in how models applied learned strategies across different problem scales, with some models successfully handling 100-move sequences in one puzzle type while failing after just five moves in simpler scenarios.

Apple Researchers Challenge AI Reasoning Claims With Controlled Puzzle Tests

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 71 Comments Log In/Create an Account

Comments Filter:

Good, we are not going extinct just yet (Score:2)

by sinij ( 911942 ) writes:

We are nowhere near addressing AI alignment, this means that humanity still has time to find a solution.
- Re:Good, we are not going extinct just yet (Score:4, Funny)
  
  by Chris Mattern ( 191822 ) writes: on Monday June 09, 2025 @09:13AM (#65437447)
  
  "We are nowhere near addressing AI alignment"
  As long as it isn't Chaotic Evil, we should be okay.
  
  - Re: (Score:2)
    
    by zlives ( 2009072 ) writes:
    
    i would be more worried about LE, they have a plan.
    - Re: (Score:2)
      
      by cwsumner ( 1303261 ) writes:
      
      Lawful Good might be just as harmful, if it makes mistakes.
      And yes, computer programs -can- make mistakes. Just in case some think not... 8-}
  - Re: (Score:3)
    
    by sinij ( 911942 ) writes:
    
    What is AI alignment? [ibm.com]
    Researchers have identified four key principals of AI alignment: robustness, interpretability, controllability and ethicality.
    - Re: (Score:2)
      
      by cwsumner ( 1303261 ) writes:
      
      Very interesting. And yes, we were talking a different language. 8-)
      But the axes, that we used, are orthogonal to the ones listed in the article. So the systems are independent.
- Maybe (Score:2)
  
  by Bruce66423 ( 1678196 ) writes:
  
  Or maybe that's just what Skynet wants you to think...
- Re:I have a sneaking suspicion... (Score:4, Insightful)
  
  by sinij ( 911942 ) writes: on Monday June 09, 2025 @09:31AM (#65437493)
  
  This is what I used to think, but I changed my mind. I think missing ingredient is evolutionary pressure. That is, complexity alone is not sufficient, you have to have selective pressures for self-organization to manifest itself.
  
  - - Re: (Score:3)
      
      by HiThere ( 15173 ) writes:
      
      It was applied, you just need a slightly more basic definition of evolution. Rather than "survival of the fittest" consider "survival of the stable". With that slight modification it handles the evolution of planets, reproducing molecules, life, species, stars, etc. And "the fittest" was always defined in terms of being stable in a particular environment.
      - Re: (Score:2)
        
        by zurkeyon ( 1546501 ) writes:
        
        Would this not also be assuming that these principals do not apply to systems crafted by humans?
        
        Re: (Score:2)
        
        by HiThere ( 15173 ) writes:
        
        Nope. They apply everywhere, everywhen. If something is unstable in an environment, it tends to disappear from that environment. If it's stable (by definition) it tends to persist. And the environment is all those things it's interacting with.
  - Re: (Score:2)
    
    by zurkeyon ( 1546501 ) writes:
    
    Also, by poking and prodding it with ever changing variables, and coding it in such a way as to attempt to adapt and account for that, are we not providing it with a type of that evolutionary pressure? Without physical form, what would "Pressure" look like? What form would it take? Food for thought.
    - Re: (Score:2)
      
      by sinij ( 911942 ) writes:
      
      Without physical form, what would "Pressure" look like?
      Ability to self-replicate (massively bad idea for us to do that, but for a different reason) but only after accomplishing specific task. This will result in optimization for reproductive fitness as dictated by the task.
  - Re: (Score:2)
    
    by strikethree ( 811449 ) writes:
    
    You are not wrong...
    But FAR more than that is needed.
  - Re: (Score:2)
    
    by rocket rancher ( 447670 ) writes:
    
    This is what I used to think, but I changed my mind. I think missing ingredient is evolutionary pressure. That is, complexity alone is not sufficient, you have to have selective pressures for self-organization to manifest itself.
    You're absolutely right that evolutionary pressure played a central role in shaping human cognition—but it's not clear that such pressures are either necessary or relevant for AGI. Evolution optimizes across generations via death, mutation, and selection under scarcity. AGI, by contrast, is engineered and optimized across iterations via gradient descent, reward shaping, and architectural tuning—deliberately, and often under conditions of abundance (data, compute, replication). The two systems c
Did Apple just give LLMs their "XOR moment"? (Score:2, Interesting)

by michaelmalak ( 91262 ) writes:

Apple’s new paper on GSM-Symbolic shows that today’s best language models crumble when a gradeschool math word problem is re-phrased -- even if the logic is identical. It echoes 1969, when Minsky&Papert proved that a singlelayer perceptron could never learn XOR.
That blockade vanished in 1986 with backprop and nonlinear hidden layers. My bet: LLMs won’t need two decades to cross the reasoning gap. Why? Agents that call scratchpad Python or GraphRAG pipelines already externalize formal r
- Re: (Score:3)
  
  by Gilmoure ( 18428 ) writes:
  
  What does "turning the model into a planner rather than a prover." mean?
  - Re: (Score:2)
    
    by michaelmalak ( 91262 ) writes:
    
    By "prover", I meant that Agentic AI is not a single-shot execution engine, like LLMs of today or theorem provers of yore. By "planner" I meant externalizing the logic/reasoning. Perhaps I was aiming too much for alliteration.
    - Re: (Score:2)
      
      by Gilmoure ( 18428 ) writes:
      
      Gotcha, thanks!
      I hvaen't kept up with actual neural network research and terminology; just the popular AI buzzwords.
- Re: (Score:3)
  
  by dinfinity ( 2300094 ) writes:
  
  No, it's a very questionable, non-peer reviewed paper. See criticisms here:
  - https://www.seangoedecke.com/i... [seangoedecke.com]
  - https://x.com/scaling01/status... [x.com]
  - Re: (Score:2)
    
    by DamnOregonian ( 963763 ) writes:
    
    Indeed.
    Every time one of these papers come up showing how LLMs can't do a thing, it takes me a day or two to show that they can- and they just didn't try hard enough.
    Which is really fucking bad. Science is about rigor. Lacking rigor indicates these people aren't trying to find actual answers.
In other words, (Score:2)

by jenningsthecat ( 1525947 ) writes:

it's a crap shoot:
some models successfully handling 100-move sequences in one puzzle type while failing after just five moves in simpler scenarios
So let's make something that we don't fully understand, whose modus operandi doubles as emergent behaviour, and then start relying on it for activities ranging from education to infrastructure. Sounds like a great idea!
- Re: (Score:2)
  
  by timeOday ( 582209 ) writes:
  
  A more typical use case for now would be using AI to generate some code, and then testing/fixing the code. Not running the AI every time to solve an instance of the problem.
  Side note, I wonder if this paper compared AI performance to human performance. You think people can do towers of hanoi consistently?
  - Re: (Score:2)
    
    by HiThere ( 15173 ) writes:
    
    How tall? Most people can do 3-ring towers consistently. I don't know anyone who can do eight rings. (I've also seen versions with 4 pegs, but I don't know what that does to the math.)
    - Re: (Score:2)
      
      by DamnOregonian ( 963763 ) writes:
      
      The model performance really started collapsing around 10.
      Up to then, they had what I would call "far-fucking-better-than-most-humans" performance.
      - Re: In other words, (Score:2)
        
        by reanjr ( 588767 ) writes:
        
        Isn't Tower of Hanoi a deterministic problem? Can't you just follow a simple algorithm to solve any ToH problem? I thought the "game" was to try to do it in as few moves as possible. But with infinite search time, I thought it was a pretty straight forward problem.
        
        Re: (Score:2)
        
        by DamnOregonian ( 963763 ) writes:
        
        It indeed is, and yes you can.
        The problem is the minimum number of moves required for 10 discs- which is 1023.
        The difficulty, is not making a mistake, or noticing if you do in time to not have to roll back.
        
        It's a perfectly solvable problem for a computer. For humans, the more discs you have, the harder it gets to do.
        I'm not sure there's a person alive who has ever solved a 10-disk Towers of Hanoi.
        
        Humans aren't particularly good computers.
        
        Re: (Score:2)
        
        by reanjr ( 588767 ) writes:
        
        If it's deterministic, then a half-retarded AI should be able to determine the algorithm and then run it trillions of times faster than a human. The fact that it can't do that strongly implies it's fully functionally retarded.
        
        Re: (Score:2)
        
        by DamnOregonian ( 963763 ) writes:
        
        That's absurd.
        It can solve much faster than a human, up to its complexity limit, which is higher than any human I'm aware of.
        
        However, it is not unlimited, and the longer the context window gets, the more likely you are to run into weak inferences.
        In tests like these, this shows up as mistakes.
        
        If it's "fully functionally retarded", then humans are even more so, because you couldn't come anywhere close to running the recursive algorithm requires to compute the 1023 steps.
        LLMs are not general purpose co
        
        Re: In other words, (Score:2)
        
        by reanjr ( 588767 ) writes:
        
        It demonstrates the model doesn't have any conceptualization of what it's doing. It's not reasoning in any real way. It's regurgitating. Very fast. But still completely retarded.
        
        Re: (Score:2)
        
        by DamnOregonian ( 963763 ) writes:
        
        It demonstrates the model doesn't have any conceptualization of what it's doing. It's not reasoning in any real way. It's regurgitating. Very fast. But still completely retarded.
        Perhaps I wasn't clear- it doesn't demonstrate anything that isn't equally demonstrated for a person.
        You also cannot solve a 10 ring Towers of Hanoi.
        Is it because you're completely retarded?
        
        Re: In other words, (Score:2)
        
        by reanjr ( 588767 ) writes:
        
        I did actually just run through a digital version of the problem. It took me 15 minutes to get 1-6 moved. I'm absolutely confident I can do the whole thing.
        
        Re: (Score:2)
        
        by DamnOregonian ( 963763 ) writes:
        
        The problem space increases exponentially. I am confident you cannot.
        
        Re: In other words, (Score:2)
        
        by reanjr ( 588767 ) writes:
        
        That increased space is what I realized. That's why I stopped. But following the algorithm isn't that hard. It's the equivalent of like 4 lines of code.
        
        Re: (Score:2)
        
        by DamnOregonian ( 963763 ) writes:
        
        Correct- it's not hard, except that as the number of moves increases, your accumulated chance of making a mistake increases.
        This kind of problem space isn't one that LLMs are particularly good at- it requires far too much state.
        
        Refactoring the problem agentically, I'm quite certain even a small LLM could solve the problem easily. But as a 1-shot? No- the context is just too big for these large-scale problems.
        
        Imagine now that you were doing all of this in your head, and after all of that, you had to pro
        
        Re: In other words, (Score:2)
        
        by reanjr ( 588767 ) writes:
        
        Mistakes are REALLY hard to make. Have you actually tried it? The smallest piece keeps moving the same direction every other move, like clockwork. And every other move is the only legal move. Now maybe with a physical set, I might accidentally put a larger disc onto a smaller one, but in a digital setting that disallows illegal moves, this is essentially mindless. You'd have to be pretty high to make a notable mistake.
        
        Re: In other words, (Score:2)
        
        by reanjr ( 588767 ) writes:
        
        Also, this problem requires very little state. The equivalent of one but (did I just move the smallest piece).
        
        Re: (Score:3)
        
        by DamnOregonian ( 963763 ) writes:
        
        The board and current position after each move are the state. 3 stacks across 3 pegs.
        Since the LLM's context cannot be arbitrarily written to, only appended to, it must essentially record the entire configuration space of the towers as well as the couple of variables for the algorithm (Step #, even? odd? IIRC). In practice, I was only able to do 6 disks before I ran out of context on my local models. Performance degraded seriously at 5.
        As I said, using an agentic system where you managed the board and con
        
        Re: In other words, (Score:2)
        
        by reanjr ( 588767 ) writes:
        
        I feel like you might just not be understanding how easy this is to solve by hand. Are you trying to use a recursive algorithm? This is actually very trivial. If you think it's hard, I gotta be honest: you either don't understand the problem or you're kind of dumb.
        
        Re: (Score:2)
        
        by DamnOregonian ( 963763 ) writes:
        
        Do it in your head. 10 disks. I'll wait.
        
        Forever. Because you can't fucking do it.
        Solving it with a physical or digital representation of the puzzle is simply a matter of repeating a small set of rules. Nobody is contesting that.
        That is not what was done. What was done was something you couldn't hope to do in your entire lifetime- solve 9 disks entirely from memory with no representation of the puzzle other than what's in your head.
        
        I feel like you might actually be too dumb to put the pieces together
It makes sense. (Score:5, Interesting)

by devslash0 ( 4203435 ) writes: on Monday June 09, 2025 @09:26AM (#65437475)

Complex puzzles require deep reasoning.
As humans, we are programmed to use our brains and multi-paradigm experience to quickly trim down the decision tree of obviously-wrong solutions. As we go down the complexity depth, we prune more silly solutions and just refine the end outcome; we become better at homing in on the solution.
AI models are different in this regard. They are just statistical probability machines. The greater the complexity depth, the more variables they need to consider in the equation, and without actual intelligence and perception of the problem, they are fundamentally unable to accurately and efficiently discriminate against obviously wrong solutions; paralysed and requiring more and more computional power with no guarantee of a good outcome.

- Re: (Score:2)
  
  by HiThere ( 15173 ) writes:
  
  So what you're saying is the success of Alpha-Beta pruning depends on the evaluation function. Yeah, that's correct. Getting the evaluation function, though, can be a real problem
  - Re: (Score:2)
    
    by chas.williams ( 6256556 ) writes:
    
    We can simply ask AI what the optimal evaluation function is and instruct it to use that. That's why AI works so well. /s
- Re: (Score:2)
  
  by rocket rancher ( 447670 ) writes:
  
  Complex puzzles require deep reasoning.
  
  True in spirit, but misleading in implication. "Deep reasoning" isn’t synonymous with explicit, stepwise logic. Much of human problem-solving relies on heuristics, pattern recognition, and compressed experience. We often simulate solutions rather than derive them. The complexity of a puzzle doesn’t necessarily demand conscious logic—it demands a good internal model that can make the right inferences efficiently, which is a broader and deeper capability than just reasoning.
  As humans, we are programmed to use our brains and multi-paradigm experience to quickly trim down the decision tree of obviously-wrong solutions.
  That’s not
And all of these are above the human baseline (Score:2, Interesting)

by JoshuaZ ( 1134087 ) writes:

It is worth noting even the easiest puzzles here are puzzles which many, if not most humans, cannot solve. The fact that we're now evaluating AI reasoning based on puzzles above human baseline should itself be pretty alarming. But instead we've moved the goalposts and so are reassuring ourselves that the AIs cannot easily solve genuinely tricky puzzles.
- Re:And all of these are above the human baseline (Score:4, Insightful)
  
  by evanh ( 627108 ) writes: on Monday June 09, 2025 @10:16AM (#65437583)
  
  I think it's the opposite. These were straight up reasoning problems. No complex maths involved.
  ie: When the LLMs have no templates to paste from they go random.
  
  - Re: (Score:2)
    
    by DamnOregonian ( 963763 ) writes:
    
    No. The models successfully solved Towers of Hanoi up to ~10 disks.
    I'd pay money to see a human make all 1023 required moves correctly.
    
    To call such a puzzle "tricky" is the understatement of the decade.
- Re: (Score:2)
  
  by Paradise Pete ( 33184 ) writes:
  
  It is worth noting even the easiest puzzles here are puzzles which many, if not most humans, cannot solve.
  
  At one point they gave it the algorithm and it still failed. My guess is that the large majority of humans could do it if you showed them how.
  - Re: (Score:1)
    
    by Krishnoid ( 984597 ) writes:
    
    The large majority of humans in the US put someone in office who has no interest in keeping his story straight. I think we all need to re-evaluate our expectations of the general world population.
    - Re: (Score:2)
      
      by account_deleted ( 4530225 ) writes:
      
      Comment removed based on user account deletion
  - Re: (Score:2)
    
    by WaffleMonster ( 969671 ) writes:
    
    At one point they gave it the algorithm and it still failed. My guess is that the large majority of humans could do it if you showed them how.
    Yea all you have to do is follow the algorithm in your head, no counting on fingers or paper and pencil. Easy peasy.
  - Re: (Score:1)
    
    by DamnOregonian ( 963763 ) writes:
    
    You think a large number of humans could solve a 10-disk Towers of Hanoi? All 1023 moves, without mistake?
    
    Fascinating. I'd wager you couldn't successfully solve 7 disks, even with the algorithm in front of you.
- Re: (Score:2)
  
  by strikethree ( 811449 ) writes:
  
  The overlap between the dumbest humans and the smartest bears has caused a number of issues for park rangers. Look at garbage cans: How do you create something hard enough to get into, a bear can't figure it out, but a human can?
Some jobs are safe! (Score:1)

by Jhon ( 241832 ) writes:

Awesome! Professional puzzle solvers wont be collecting unemployment in the short term. The bad news is that unemployment will likely be broke when toasters can solve puzzles 5 or 10 years later...
- Re: (Score:2)
  
  by 93 Escort Wagon ( 326346 ) writes:
  
  My title is Linux Systems Administrator, but much of the time I feel like it's Professional Puzzle Solver*. Does that count?
  *although most of the time I feel more like it's either Nagger In Chief or Everybody's Gofer.
  - Re: (Score:2)
    
    by awwshit ( 6214476 ) writes:
    
    I'm pretty much a Digital Garbage Collector.
    Bring out yer dead...
human confidence level high. (Score:1)

by laxr5rs ( 2658895 ) writes:

I failed puzzles because I don't waste my time on them. I like real life puzzles. maybe the AIs are just bored
The first sentient AI (Score:2)

by Dr. Tom ( 23206 ) writes:

will be able to tell us, no need to give it tests, talk with it for a few minutes and you'd know if it's from Florida or not
Seems to fall apart above 200 LOC (Score:2)

by Hadlock ( 143607 ) writes:

LLMs are really good at stuff, better and faster than humans, as long as the complexity isn't much more than ~200 LOC (lines of code). 250-300 LOC and things start falling apart quickly. Sometimes (1/50) you'll get ready and it'll pop out 400 LOC without major errors but that seems to be the absolute limit of the current statistical model family everyone is using.

LLMs are really good at analyzing and summarizing text though, it has no problem analyzing 20-30 page PDFs of economic or financial data.
If you can't beat 'em, berate 'em (Score:2)

by omnichad ( 1198475 ) writes:

They can't compete, so they are doing the next best thing - trying to prove to everyone that the buzz around the other companies isn't as big as they are made out to be.
So... (Score:2)

by SuperDre ( 982372 ) writes:

So exactly like real humans? As a lot of humans cannot pass those puzzles too. But then again, you expect from AI to easily pass those puzzles. But in contrast to humans, AI can actually get better, whereas us regular humans cannot, yeah, there are and always will be wunderkinderen..
Good but insufficient (Score:2)

by jd ( 1658 ) writes:

I've mentioned this before, but I had Gemini, ChatGPT, and Claude jointly design me an aircraft, along with its engines. The sheer intricacy and complexity of the problem is such that it can take engineers years to get to what all three AIs agree is a good design. Grok took a look at as much as it could, before running out of space, and agreed it was sound.
Basically, I gave an initial starting point (a historic aircraft) and had each in turn fix issues with the previous version, until all three agreed on co
- Re: (Score:2)
  
  by serviscope_minor ( 664417 ) writes:
  
  What do you mean "designed an aircraft"? What sort of level of granularity are we talking about?
  - Re: (Score:2)
    
    by jd ( 1658 ) writes:
    
    The spec it came up with includes: which specific material is used for which specific component, additional components to handle cases where there's chemically incompatible or thermally incompatible materials in proximity, what temperature regulation is needed where (and how), placement of sensors, pressure restrictions, details of computer network security, the design of the computers, network protocols, network topology, design modifications needed to pre-existing designs - it's impressively detailed.
    I've
Reduced computational effort (Score:2)

by Another Random Kiwi ( 6224294 ) writes:

Most striking was the counterintuitive finding that reasoning models actually reduced their computational effort as problems became more difficult, despite operating well below their token generation limits.
"Reasoning is tough" -- AI Talk Barbie.
"Thinking too much gives you wrinkles." -- Malibu Stacy
Not puzzling (Score:2)

by laughingskeptic ( 1004414 ) writes:

In section "4.4 Open Questions: Puzzling Behavior ", they state:

... even when we provide the algorithm in the prompt—so that the model only needs to execute the prescribed steps—performance does not improve, and the observed collapse still occurs at roughly the same point. This is noteworthy because finding and devising a solution should require substantially more computation (e.g., for search and verification) than merely executing a given algorithm.
When one considers that these are generators and not program execution engines, this becomes non-puzzling. I see this all the time with these engines when it comes to complex math. They can generate a correct Python program to solve a problem and then proceed to generate a completely bogus output. For instance for a vector output, the only thing being correct being the length of the vector. Because the LLM does not have the capacity to actu
- Re: (Score:2)
  
  by DamnOregonian ( 963763 ) writes:
  
  If anyone ever does allow these engines to actually execute the code they generate, the horrible math guessing could be fixed, but the cost to allow this is going to be very high.
  This exists now- via various tools. They generally call it "Agentic AI" or some other marketing term- but it's running the LLM-generated code and giving it the feedback (or using it)
  It's fully integrated into several pipelines with varying levels of sandboxing/safety.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Good, we are not going extinct just yet (Score:2)

Re:Good, we are not going extinct just yet (Score:4, Funny)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Maybe (Score:2)

Re:I have a sneaking suspicion... (Score:4, Insightful)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Did Apple just give LLMs their "XOR moment"? (Score:2, Interesting)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

In other words, (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: In other words, (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: In other words, (Score:2)

Re: (Score:2)

Re: In other words, (Score:2)

Re: (Score:2)

Re: In other words, (Score:2)

Re: (Score:2)

Re: In other words, (Score:2)

Re: In other words, (Score:2)

Re: (Score:3)

Re: In other words, (Score:2)

Re: (Score:2)

It makes sense. (Score:5, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

And all of these are above the human baseline (Score:2, Interesting)

Re:And all of these are above the human baseline (Score:4, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Some jobs are safe! (Score:1)

Re: (Score:2)

Re: (Score:2)

human confidence level high. (Score:1)

The first sentient AI (Score:2)

Seems to fall apart above 200 LOC (Score:2)

If you can't beat 'em, berate 'em (Score:2)

So... (Score:2)

Good but insufficient (Score:2)

Re: (Score:2)

Re: (Score:2)

Reduced computational effort (Score:2)

Not puzzling (Score:2)

Re: (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals