
Apple Researchers Challenge AI Reasoning Claims With Controlled Puzzle Tests 43
Apple researchers have found that state-of-the-art "reasoning" AI models like OpenAI's o3-mini, Gemini (with thinking mode-enabled), Claude 3.7, DeepSeek-R1 face complete performance collapse [PDF] beyond certain complexity thresholds when tested on controllable puzzle environments. The finding raises questions about the true reasoning capabilities of large language models.
The study, which examined models using Tower of Hanoi, checker jumping, river crossing, and blocks world puzzles rather than standard mathematical benchmarks, found three distinct performance regimes that contradict conventional assumptions about AI reasoning progress.
At low complexity levels, standard language models surprisingly outperformed their reasoning-enhanced counterparts while using fewer computational resources. At medium complexity, reasoning models demonstrated advantages, but both model types experienced complete accuracy collapse at high complexity levels. Most striking was the counterintuitive finding that reasoning models actually reduced their computational effort as problems became more difficult, despite operating well below their token generation limits.
Even when researchers provided explicit solution algorithms, requiring only step-by-step execution rather than creative problem-solving, the models' performance failed to improve significantly. The researchers noted fundamental inconsistencies in how models applied learned strategies across different problem scales, with some models successfully handling 100-move sequences in one puzzle type while failing after just five moves in simpler scenarios.
The study, which examined models using Tower of Hanoi, checker jumping, river crossing, and blocks world puzzles rather than standard mathematical benchmarks, found three distinct performance regimes that contradict conventional assumptions about AI reasoning progress.
At low complexity levels, standard language models surprisingly outperformed their reasoning-enhanced counterparts while using fewer computational resources. At medium complexity, reasoning models demonstrated advantages, but both model types experienced complete accuracy collapse at high complexity levels. Most striking was the counterintuitive finding that reasoning models actually reduced their computational effort as problems became more difficult, despite operating well below their token generation limits.
Even when researchers provided explicit solution algorithms, requiring only step-by-step execution rather than creative problem-solving, the models' performance failed to improve significantly. The researchers noted fundamental inconsistencies in how models applied learned strategies across different problem scales, with some models successfully handling 100-move sequences in one puzzle type while failing after just five moves in simpler scenarios.
Good, we are not going extinct just yet (Score:2)
Re:Good, we are not going extinct just yet (Score:4, Funny)
"We are nowhere near addressing AI alignment"
As long as it isn't Chaotic Evil, we should be okay.
Re: (Score:2)
i would be more worried about LE, they have a plan.
Re: (Score:2)
Lawful Good might be just as harmful, if it makes mistakes.
And yes, computer programs -can- make mistakes. Just in case some think not... 8-}
Re: (Score:2)
Re: (Score:2)
Very interesting. And yes, we were talking a different language. 8-)
But the axes, that we used, are orthogonal to the ones listed in the article. So the systems are independent.
Maybe (Score:2)
Or maybe that's just what Skynet wants you to think...
Re:I have a sneaking suspicion... (Score:4, Insightful)
Re: (Score:2)
It was applied, you just need a slightly more basic definition of evolution. Rather than "survival of the fittest" consider "survival of the stable". With that slight modification it handles the evolution of planets, reproducing molecules, life, species, stars, etc. And "the fittest" was always defined in terms of being stable in a particular environment.
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Without physical form, what would "Pressure" look like?
Ability to self-replicate (massively bad idea for us to do that, but for a different reason) but only after accomplishing specific task. This will result in optimization for reproductive fitness as dictated by the task.
Did Apple just give LLMs their "XOR moment"? (Score:2, Interesting)
Apple’s new paper on GSM-Symbolic shows that today’s best language models crumble when a gradeschool math word problem is re-phrased -- even if the logic is identical. It echoes 1969, when Minsky&Papert proved that a singlelayer perceptron could never learn XOR.
That blockade vanished in 1986 with backprop and nonlinear hidden layers. My bet: LLMs won’t need two decades to cross the reasoning gap. Why? Agents that call scratchpad Python or GraphRAG pipelines already externalize formal r
Re: (Score:3)
What does "turning the model into a planner rather than a prover." mean?
Re: (Score:2)
Re: (Score:2)
Gotcha, thanks!
I hvaen't kept up with actual neural network research and terminology; just the popular AI buzzwords.
In other words, (Score:2)
it's a crap shoot:
some models successfully handling 100-move sequences in one puzzle type while failing after just five moves in simpler scenarios
So let's make something that we don't fully understand, whose modus operandi doubles as emergent behaviour, and then start relying on it for activities ranging from education to infrastructure. Sounds like a great idea!
Re: (Score:2)
Side note, I wonder if this paper compared AI performance to human performance. You think people can do towers of hanoi consistently?
Re: (Score:2)
How tall? Most people can do 3-ring towers consistently. I don't know anyone who can do eight rings. (I've also seen versions with 4 pegs, but I don't know what that does to the math.)
Re: (Score:2)
Up to then, they had what I would call "far-fucking-better-than-most-humans" performance.
Re: In other words, (Score:2)
Isn't Tower of Hanoi a deterministic problem? Can't you just follow a simple algorithm to solve any ToH problem? I thought the "game" was to try to do it in as few moves as possible. But with infinite search time, I thought it was a pretty straight forward problem.
It makes sense. (Score:5, Interesting)
Complex puzzles require deep reasoning.
As humans, we are programmed to use our brains and multi-paradigm experience to quickly trim down the decision tree of obviously-wrong solutions. As we go down the complexity depth, we prune more silly solutions and just refine the end outcome; we become better at homing in on the solution.
AI models are different in this regard. They are just statistical probability machines. The greater the complexity depth, the more variables they need to consider in the equation, and without actual intelligence and perception of the problem, they are fundamentally unable to accurately and efficiently discriminate against obviously wrong solutions; paralysed and requiring more and more computional power with no guarantee of a good outcome.
Re: (Score:2)
So what you're saying is the success of Alpha-Beta pruning depends on the evaluation function. Yeah, that's correct. Getting the evaluation function, though, can be a real problem
And all of these are above the human baseline (Score:3)
Re: (Score:3)
I think it's the opposite. These were straight up reasoning problems. No complex maths involved.
ie: When the LLMs have no templates to paste from they go random.
Re: (Score:2)
I'd pay money to see a human make all 1023 required moves correctly.
To call such a puzzle "tricky" is the understatement of the decade.
Re: (Score:2)
It is worth noting even the easiest puzzles here are puzzles which many, if not most humans, cannot solve.
At one point they gave it the algorithm and it still failed. My guess is that the large majority of humans could do it if you showed them how.
Re: (Score:1)
Re: (Score:2)
At one point they gave it the algorithm and it still failed. My guess is that the large majority of humans could do it if you showed them how.
Yea all you have to do is follow the algorithm in your head, no counting on fingers or paper and pencil. Easy peasy.
Re: (Score:1)
Fascinating. I'd wager you couldn't successfully solve 7 disks, even with the algorithm in front of you.
Some jobs are safe! (Score:1)
Awesome! Professional puzzle solvers wont be collecting unemployment in the short term. The bad news is that unemployment will likely be broke when toasters can solve puzzles 5 or 10 years later...
Re: (Score:2)
My title is Linux Systems Administrator, but much of the time I feel like it's Professional Puzzle Solver*. Does that count?
*although most of the time I feel more like it's either Nagger In Chief or Everybody's Gofer.
Re: (Score:2)
I'm pretty much a Digital Garbage Collector.
Bring out yer dead...
human confidence level high. (Score:1)
The first sentient AI (Score:2)
will be able to tell us, no need to give it tests, talk with it for a few minutes and you'd know if it's from Florida or not
Seems to fall apart above 200 LOC (Score:2)
LLMs are really good at stuff, better and faster than humans, as long as the complexity isn't much more than ~200 LOC (lines of code). 250-300 LOC and things start falling apart quickly. Sometimes (1/50) you'll get ready and it'll pop out 400 LOC without major errors but that seems to be the absolute limit of the current statistical model family everyone is using.
LLMs are really good at analyzing and summarizing text though, it has no problem analyzing 20-30 page PDFs of economic or financial data.
If you can't beat 'em, berate 'em (Score:2)
They can't compete, so they are doing the next best thing - trying to prove to everyone that the buzz around the other companies isn't as big as they are made out to be.
So... (Score:2)
Good but insufficient (Score:2)
I've mentioned this before, but I had Gemini, ChatGPT, and Claude jointly design me an aircraft, along with its engines. The sheer intricacy and complexity of the problem is such that it can take engineers years to get to what all three AIs agree is a good design. Grok took a look at as much as it could, before running out of space, and agreed it was sound.
Basically, I gave an initial starting point (a historic aircraft) and had each in turn fix issues with the previous version, until all three agreed on co