
Apple Researchers Challenge AI Reasoning Claims With Controlled Puzzle Tests 71
Apple researchers have found that state-of-the-art "reasoning" AI models like OpenAI's o3-mini, Gemini (with thinking mode-enabled), Claude 3.7, DeepSeek-R1 face complete performance collapse [PDF] beyond certain complexity thresholds when tested on controllable puzzle environments. The finding raises questions about the true reasoning capabilities of large language models.
The study, which examined models using Tower of Hanoi, checker jumping, river crossing, and blocks world puzzles rather than standard mathematical benchmarks, found three distinct performance regimes that contradict conventional assumptions about AI reasoning progress.
At low complexity levels, standard language models surprisingly outperformed their reasoning-enhanced counterparts while using fewer computational resources. At medium complexity, reasoning models demonstrated advantages, but both model types experienced complete accuracy collapse at high complexity levels. Most striking was the counterintuitive finding that reasoning models actually reduced their computational effort as problems became more difficult, despite operating well below their token generation limits.
Even when researchers provided explicit solution algorithms, requiring only step-by-step execution rather than creative problem-solving, the models' performance failed to improve significantly. The researchers noted fundamental inconsistencies in how models applied learned strategies across different problem scales, with some models successfully handling 100-move sequences in one puzzle type while failing after just five moves in simpler scenarios.
The study, which examined models using Tower of Hanoi, checker jumping, river crossing, and blocks world puzzles rather than standard mathematical benchmarks, found three distinct performance regimes that contradict conventional assumptions about AI reasoning progress.
At low complexity levels, standard language models surprisingly outperformed their reasoning-enhanced counterparts while using fewer computational resources. At medium complexity, reasoning models demonstrated advantages, but both model types experienced complete accuracy collapse at high complexity levels. Most striking was the counterintuitive finding that reasoning models actually reduced their computational effort as problems became more difficult, despite operating well below their token generation limits.
Even when researchers provided explicit solution algorithms, requiring only step-by-step execution rather than creative problem-solving, the models' performance failed to improve significantly. The researchers noted fundamental inconsistencies in how models applied learned strategies across different problem scales, with some models successfully handling 100-move sequences in one puzzle type while failing after just five moves in simpler scenarios.
Good, we are not going extinct just yet (Score:2)
Re:Good, we are not going extinct just yet (Score:4, Funny)
"We are nowhere near addressing AI alignment"
As long as it isn't Chaotic Evil, we should be okay.
Re: (Score:2)
i would be more worried about LE, they have a plan.
Re: (Score:2)
Lawful Good might be just as harmful, if it makes mistakes.
And yes, computer programs -can- make mistakes. Just in case some think not... 8-}
Re: (Score:3)
Re: (Score:2)
Very interesting. And yes, we were talking a different language. 8-)
But the axes, that we used, are orthogonal to the ones listed in the article. So the systems are independent.
Maybe (Score:2)
Or maybe that's just what Skynet wants you to think...
Re:I have a sneaking suspicion... (Score:4, Insightful)
Re: (Score:3)
It was applied, you just need a slightly more basic definition of evolution. Rather than "survival of the fittest" consider "survival of the stable". With that slight modification it handles the evolution of planets, reproducing molecules, life, species, stars, etc. And "the fittest" was always defined in terms of being stable in a particular environment.
Re: (Score:2)
Re: (Score:2)
Nope. They apply everywhere, everywhen. If something is unstable in an environment, it tends to disappear from that environment. If it's stable (by definition) it tends to persist. And the environment is all those things it's interacting with.
Re: (Score:2)
Re: (Score:2)
Without physical form, what would "Pressure" look like?
Ability to self-replicate (massively bad idea for us to do that, but for a different reason) but only after accomplishing specific task. This will result in optimization for reproductive fitness as dictated by the task.
Re: (Score:2)
You are not wrong...
But FAR more than that is needed.
Re: (Score:2)
This is what I used to think, but I changed my mind. I think missing ingredient is evolutionary pressure. That is, complexity alone is not sufficient, you have to have selective pressures for self-organization to manifest itself.
You're absolutely right that evolutionary pressure played a central role in shaping human cognition—but it's not clear that such pressures are either necessary or relevant for AGI. Evolution optimizes across generations via death, mutation, and selection under scarcity. AGI, by contrast, is engineered and optimized across iterations via gradient descent, reward shaping, and architectural tuning—deliberately, and often under conditions of abundance (data, compute, replication). The two systems c
Did Apple just give LLMs their "XOR moment"? (Score:2, Interesting)
Apple’s new paper on GSM-Symbolic shows that today’s best language models crumble when a gradeschool math word problem is re-phrased -- even if the logic is identical. It echoes 1969, when Minsky&Papert proved that a singlelayer perceptron could never learn XOR.
That blockade vanished in 1986 with backprop and nonlinear hidden layers. My bet: LLMs won’t need two decades to cross the reasoning gap. Why? Agents that call scratchpad Python or GraphRAG pipelines already externalize formal r
Re: (Score:3)
What does "turning the model into a planner rather than a prover." mean?
Re: (Score:2)
Re: (Score:2)
Gotcha, thanks!
I hvaen't kept up with actual neural network research and terminology; just the popular AI buzzwords.
Re: (Score:3)
No, it's a very questionable, non-peer reviewed paper. See criticisms here:
- https://www.seangoedecke.com/i... [seangoedecke.com]
- https://x.com/scaling01/status... [x.com]
Re: (Score:2)
Every time one of these papers come up showing how LLMs can't do a thing, it takes me a day or two to show that they can- and they just didn't try hard enough.
Which is really fucking bad. Science is about rigor. Lacking rigor indicates these people aren't trying to find actual answers.
In other words, (Score:2)
it's a crap shoot:
some models successfully handling 100-move sequences in one puzzle type while failing after just five moves in simpler scenarios
So let's make something that we don't fully understand, whose modus operandi doubles as emergent behaviour, and then start relying on it for activities ranging from education to infrastructure. Sounds like a great idea!
Re: (Score:2)
Side note, I wonder if this paper compared AI performance to human performance. You think people can do towers of hanoi consistently?
Re: (Score:2)
How tall? Most people can do 3-ring towers consistently. I don't know anyone who can do eight rings. (I've also seen versions with 4 pegs, but I don't know what that does to the math.)
Re: (Score:2)
Up to then, they had what I would call "far-fucking-better-than-most-humans" performance.
Re: In other words, (Score:2)
Isn't Tower of Hanoi a deterministic problem? Can't you just follow a simple algorithm to solve any ToH problem? I thought the "game" was to try to do it in as few moves as possible. But with infinite search time, I thought it was a pretty straight forward problem.
Re: (Score:2)
The problem is the minimum number of moves required for 10 discs- which is 1023.
The difficulty, is not making a mistake, or noticing if you do in time to not have to roll back.
It's a perfectly solvable problem for a computer. For humans, the more discs you have, the harder it gets to do.
I'm not sure there's a person alive who has ever solved a 10-disk Towers of Hanoi.
Humans aren't particularly good computers.
Re: (Score:2)
If it's deterministic, then a half-retarded AI should be able to determine the algorithm and then run it trillions of times faster than a human. The fact that it can't do that strongly implies it's fully functionally retarded.
Re: (Score:2)
It can solve much faster than a human, up to its complexity limit, which is higher than any human I'm aware of.
However, it is not unlimited, and the longer the context window gets, the more likely you are to run into weak inferences.
In tests like these, this shows up as mistakes.
If it's "fully functionally retarded", then humans are even more so, because you couldn't come anywhere close to running the recursive algorithm requires to compute the 1023 steps.
LLMs are not general purpose co
Re: In other words, (Score:2)
It demonstrates the model doesn't have any conceptualization of what it's doing. It's not reasoning in any real way. It's regurgitating. Very fast. But still completely retarded.
Re: (Score:2)
It demonstrates the model doesn't have any conceptualization of what it's doing. It's not reasoning in any real way. It's regurgitating. Very fast. But still completely retarded.
Perhaps I wasn't clear- it doesn't demonstrate anything that isn't equally demonstrated for a person.
You also cannot solve a 10 ring Towers of Hanoi.
Is it because you're completely retarded?
Re: In other words, (Score:2)
I did actually just run through a digital version of the problem. It took me 15 minutes to get 1-6 moved. I'm absolutely confident I can do the whole thing.
Re: (Score:2)
Re: In other words, (Score:2)
That increased space is what I realized. That's why I stopped. But following the algorithm isn't that hard. It's the equivalent of like 4 lines of code.
Re: (Score:2)
This kind of problem space isn't one that LLMs are particularly good at- it requires far too much state.
Refactoring the problem agentically, I'm quite certain even a small LLM could solve the problem easily. But as a 1-shot? No- the context is just too big for these large-scale problems.
Imagine now that you were doing all of this in your head, and after all of that, you had to pro
Re: In other words, (Score:2)
Mistakes are REALLY hard to make. Have you actually tried it? The smallest piece keeps moving the same direction every other move, like clockwork. And every other move is the only legal move. Now maybe with a physical set, I might accidentally put a larger disc onto a smaller one, but in a digital setting that disallows illegal moves, this is essentially mindless. You'd have to be pretty high to make a notable mistake.
Re: In other words, (Score:2)
Also, this problem requires very little state. The equivalent of one but (did I just move the smallest piece).
Re: (Score:3)
Since the LLM's context cannot be arbitrarily written to, only appended to, it must essentially record the entire configuration space of the towers as well as the couple of variables for the algorithm (Step #, even? odd? IIRC). In practice, I was only able to do 6 disks before I ran out of context on my local models. Performance degraded seriously at 5.
As I said, using an agentic system where you managed the board and con
Re: In other words, (Score:2)
I feel like you might just not be understanding how easy this is to solve by hand. Are you trying to use a recursive algorithm? This is actually very trivial. If you think it's hard, I gotta be honest: you either don't understand the problem or you're kind of dumb.
Re: (Score:2)
Forever. Because you can't fucking do it.
Solving it with a physical or digital representation of the puzzle is simply a matter of repeating a small set of rules. Nobody is contesting that.
That is not what was done. What was done was something you couldn't hope to do in your entire lifetime- solve 9 disks entirely from memory with no representation of the puzzle other than what's in your head.
I feel like you might actually be too dumb to put the pieces together
It makes sense. (Score:5, Interesting)
Complex puzzles require deep reasoning.
As humans, we are programmed to use our brains and multi-paradigm experience to quickly trim down the decision tree of obviously-wrong solutions. As we go down the complexity depth, we prune more silly solutions and just refine the end outcome; we become better at homing in on the solution.
AI models are different in this regard. They are just statistical probability machines. The greater the complexity depth, the more variables they need to consider in the equation, and without actual intelligence and perception of the problem, they are fundamentally unable to accurately and efficiently discriminate against obviously wrong solutions; paralysed and requiring more and more computional power with no guarantee of a good outcome.
Re: (Score:2)
So what you're saying is the success of Alpha-Beta pruning depends on the evaluation function. Yeah, that's correct. Getting the evaluation function, though, can be a real problem
Re: (Score:2)
Re: (Score:2)
Complex puzzles require deep reasoning.
True in spirit, but misleading in implication. "Deep reasoning" isn’t synonymous with explicit, stepwise logic. Much of human problem-solving relies on heuristics, pattern recognition, and compressed experience. We often simulate solutions rather than derive them. The complexity of a puzzle doesn’t necessarily demand conscious logic—it demands a good internal model that can make the right inferences efficiently, which is a broader and deeper capability than just reasoning.
As humans, we are programmed to use our brains and multi-paradigm experience to quickly trim down the decision tree of obviously-wrong solutions.
That’s not
And all of these are above the human baseline (Score:2, Interesting)
Re:And all of these are above the human baseline (Score:4, Insightful)
I think it's the opposite. These were straight up reasoning problems. No complex maths involved.
ie: When the LLMs have no templates to paste from they go random.
Re: (Score:2)
I'd pay money to see a human make all 1023 required moves correctly.
To call such a puzzle "tricky" is the understatement of the decade.
Re: (Score:2)
It is worth noting even the easiest puzzles here are puzzles which many, if not most humans, cannot solve.
At one point they gave it the algorithm and it still failed. My guess is that the large majority of humans could do it if you showed them how.
Re: (Score:1)
Re: (Score:2)
Re: (Score:2)
At one point they gave it the algorithm and it still failed. My guess is that the large majority of humans could do it if you showed them how.
Yea all you have to do is follow the algorithm in your head, no counting on fingers or paper and pencil. Easy peasy.
Re: (Score:1)
Fascinating. I'd wager you couldn't successfully solve 7 disks, even with the algorithm in front of you.
Re: (Score:2)
The overlap between the dumbest humans and the smartest bears has caused a number of issues for park rangers. Look at garbage cans: How do you create something hard enough to get into, a bear can't figure it out, but a human can?
Some jobs are safe! (Score:1)
Awesome! Professional puzzle solvers wont be collecting unemployment in the short term. The bad news is that unemployment will likely be broke when toasters can solve puzzles 5 or 10 years later...
Re: (Score:2)
My title is Linux Systems Administrator, but much of the time I feel like it's Professional Puzzle Solver*. Does that count?
*although most of the time I feel more like it's either Nagger In Chief or Everybody's Gofer.
Re: (Score:2)
I'm pretty much a Digital Garbage Collector.
Bring out yer dead...
human confidence level high. (Score:1)
The first sentient AI (Score:2)
will be able to tell us, no need to give it tests, talk with it for a few minutes and you'd know if it's from Florida or not
Seems to fall apart above 200 LOC (Score:2)
LLMs are really good at stuff, better and faster than humans, as long as the complexity isn't much more than ~200 LOC (lines of code). 250-300 LOC and things start falling apart quickly. Sometimes (1/50) you'll get ready and it'll pop out 400 LOC without major errors but that seems to be the absolute limit of the current statistical model family everyone is using.
LLMs are really good at analyzing and summarizing text though, it has no problem analyzing 20-30 page PDFs of economic or financial data.
If you can't beat 'em, berate 'em (Score:2)
They can't compete, so they are doing the next best thing - trying to prove to everyone that the buzz around the other companies isn't as big as they are made out to be.
So... (Score:2)
Good but insufficient (Score:2)
I've mentioned this before, but I had Gemini, ChatGPT, and Claude jointly design me an aircraft, along with its engines. The sheer intricacy and complexity of the problem is such that it can take engineers years to get to what all three AIs agree is a good design. Grok took a look at as much as it could, before running out of space, and agreed it was sound.
Basically, I gave an initial starting point (a historic aircraft) and had each in turn fix issues with the previous version, until all three agreed on co
Re: (Score:2)
What do you mean "designed an aircraft"? What sort of level of granularity are we talking about?
Re: (Score:2)
The spec it came up with includes: which specific material is used for which specific component, additional components to handle cases where there's chemically incompatible or thermally incompatible materials in proximity, what temperature regulation is needed where (and how), placement of sensors, pressure restrictions, details of computer network security, the design of the computers, network protocols, network topology, design modifications needed to pre-existing designs - it's impressively detailed.
I've
Reduced computational effort (Score:2)
Most striking was the counterintuitive finding that reasoning models actually reduced their computational effort as problems became more difficult, despite operating well below their token generation limits.
"Reasoning is tough" -- AI Talk Barbie.
"Thinking too much gives you wrinkles." -- Malibu Stacy
Not puzzling (Score:2)
... even when we provide the algorithm in the prompt—so that the model only needs to execute the prescribed steps—performance does not improve, and the observed collapse still occurs at roughly the same point. This is noteworthy because finding and devising a solution should require substantially more computation (e.g., for search and verification) than merely executing a given algorithm.
When one considers that these are generators and not program execution engines, this becomes non-puzzling. I see this all the time with these engines when it comes to complex math. They can generate a correct Python program to solve a problem and then proceed to generate a completely bogus output. For instance for a vector output, the only thing being correct being the length of the vector. Because the LLM does not have the capacity to actu
Re: (Score:2)
If anyone ever does allow these engines to actually execute the code they generate, the horrible math guessing could be fixed, but the cost to allow this is going to be very high.
This exists now- via various tools. They generally call it "Agentic AI" or some other marketing term- but it's running the LLM-generated code and giving it the feedback (or using it)
It's fully integrated into several pipelines with varying levels of sandboxing/safety.