Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
Apple

Apple Researchers Challenge AI Reasoning Claims With Controlled Puzzle Tests 44

Apple researchers have found that state-of-the-art "reasoning" AI models like OpenAI's o3-mini, Gemini (with thinking mode-enabled), Claude 3.7, DeepSeek-R1 face complete performance collapse [PDF] beyond certain complexity thresholds when tested on controllable puzzle environments. The finding raises questions about the true reasoning capabilities of large language models.

The study, which examined models using Tower of Hanoi, checker jumping, river crossing, and blocks world puzzles rather than standard mathematical benchmarks, found three distinct performance regimes that contradict conventional assumptions about AI reasoning progress.

At low complexity levels, standard language models surprisingly outperformed their reasoning-enhanced counterparts while using fewer computational resources. At medium complexity, reasoning models demonstrated advantages, but both model types experienced complete accuracy collapse at high complexity levels. Most striking was the counterintuitive finding that reasoning models actually reduced their computational effort as problems became more difficult, despite operating well below their token generation limits.

Even when researchers provided explicit solution algorithms, requiring only step-by-step execution rather than creative problem-solving, the models' performance failed to improve significantly. The researchers noted fundamental inconsistencies in how models applied learned strategies across different problem scales, with some models successfully handling 100-move sequences in one puzzle type while failing after just five moves in simpler scenarios.

Apple Researchers Challenge AI Reasoning Claims With Controlled Puzzle Tests

Comments Filter:
  • We are nowhere near addressing AI alignment, this means that humanity still has time to find a solution.
  • Apple’s new paper on GSM-Symbolic shows that today’s best language models crumble when a gradeschool math word problem is re-phrased -- even if the logic is identical. It echoes 1969, when Minsky&Papert proved that a singlelayer perceptron could never learn XOR.

    That blockade vanished in 1986 with backprop and nonlinear hidden layers. My bet: LLMs won’t need two decades to cross the reasoning gap. Why? Agents that call scratchpad Python or GraphRAG pipelines already externalize formal r

    • by Gilmoure ( 18428 )

      What does "turning the model into a planner rather than a prover." mean?

      • By "prover", I meant that Agentic AI is not a single-shot execution engine, like LLMs of today or theorem provers of yore. By "planner" I meant externalizing the logic/reasoning. Perhaps I was aiming too much for alliteration.
        • by Gilmoure ( 18428 )

          Gotcha, thanks!

          I hvaen't kept up with actual neural network research and terminology; just the popular AI buzzwords.

  • it's a crap shoot:

    some models successfully handling 100-move sequences in one puzzle type while failing after just five moves in simpler scenarios

    So let's make something that we don't fully understand, whose modus operandi doubles as emergent behaviour, and then start relying on it for activities ranging from education to infrastructure. Sounds like a great idea!

    • A more typical use case for now would be using AI to generate some code, and then testing/fixing the code. Not running the AI every time to solve an instance of the problem.

      Side note, I wonder if this paper compared AI performance to human performance. You think people can do towers of hanoi consistently?

      • by HiThere ( 15173 )

        How tall? Most people can do 3-ring towers consistently. I don't know anyone who can do eight rings. (I've also seen versions with 4 pegs, but I don't know what that does to the math.)

        • The model performance really started collapsing around 10.
          Up to then, they had what I would call "far-fucking-better-than-most-humans" performance.
          • Isn't Tower of Hanoi a deterministic problem? Can't you just follow a simple algorithm to solve any ToH problem? I thought the "game" was to try to do it in as few moves as possible. But with infinite search time, I thought it was a pretty straight forward problem.

            • It indeed is, and yes you can.
              The problem is the minimum number of moves required for 10 discs- which is 1023.
              The difficulty, is not making a mistake, or noticing if you do in time to not have to roll back.

              It's a perfectly solvable problem for a computer. For humans, the more discs you have, the harder it gets to do.
              I'm not sure there's a person alive who has ever solved a 10-disk Towers of Hanoi.

              Humans aren't particularly good computers.
  • It makes sense. (Score:5, Interesting)

    by devslash0 ( 4203435 ) on Monday June 09, 2025 @10:26AM (#65437475)

    Complex puzzles require deep reasoning.

    As humans, we are programmed to use our brains and multi-paradigm experience to quickly trim down the decision tree of obviously-wrong solutions. As we go down the complexity depth, we prune more silly solutions and just refine the end outcome; we become better at homing in on the solution.

    AI models are different in this regard. They are just statistical probability machines. The greater the complexity depth, the more variables they need to consider in the equation, and without actual intelligence and perception of the problem, they are fundamentally unable to accurately and efficiently discriminate against obviously wrong solutions; paralysed and requiring more and more computional power with no guarantee of a good outcome.

    • by HiThere ( 15173 )

      So what you're saying is the success of Alpha-Beta pruning depends on the evaluation function. Yeah, that's correct. Getting the evaluation function, though, can be a real problem

  • It is worth noting even the easiest puzzles here are puzzles which many, if not most humans, cannot solve. The fact that we're now evaluating AI reasoning based on puzzles above human baseline should itself be pretty alarming. But instead we've moved the goalposts and so are reassuring ourselves that the AIs cannot easily solve genuinely tricky puzzles.
    • by evanh ( 627108 )

      I think it's the opposite. These were straight up reasoning problems. No complex maths involved.

      ie: When the LLMs have no templates to paste from they go random.

      • No. The models successfully solved Towers of Hanoi up to ~10 disks.
        I'd pay money to see a human make all 1023 required moves correctly.

        To call such a puzzle "tricky" is the understatement of the decade.
    • It is worth noting even the easiest puzzles here are puzzles which many, if not most humans, cannot solve.

      At one point they gave it the algorithm and it still failed. My guess is that the large majority of humans could do it if you showed them how.

      • The large majority of humans in the US put someone in office who has no interest in keeping his story straight. I think we all need to re-evaluate our expectations of the general world population.
      • At one point they gave it the algorithm and it still failed. My guess is that the large majority of humans could do it if you showed them how.

        Yea all you have to do is follow the algorithm in your head, no counting on fingers or paper and pencil. Easy peasy.

      • You think a large number of humans could solve a 10-disk Towers of Hanoi? All 1023 moves, without mistake?

        Fascinating. I'd wager you couldn't successfully solve 7 disks, even with the algorithm in front of you.
  • Awesome! Professional puzzle solvers wont be collecting unemployment in the short term. The bad news is that unemployment will likely be broke when toasters can solve puzzles 5 or 10 years later...

    • My title is Linux Systems Administrator, but much of the time I feel like it's Professional Puzzle Solver*. Does that count?

      *although most of the time I feel more like it's either Nagger In Chief or Everybody's Gofer.

  • I failed puzzles because I don't waste my time on them. I like real life puzzles. maybe the AIs are just bored
  • will be able to tell us, no need to give it tests, talk with it for a few minutes and you'd know if it's from Florida or not

  • LLMs are really good at stuff, better and faster than humans, as long as the complexity isn't much more than ~200 LOC (lines of code). 250-300 LOC and things start falling apart quickly. Sometimes (1/50) you'll get ready and it'll pop out 400 LOC without major errors but that seems to be the absolute limit of the current statistical model family everyone is using.

    LLMs are really good at analyzing and summarizing text though, it has no problem analyzing 20-30 page PDFs of economic or financial data.

  • They can't compete, so they are doing the next best thing - trying to prove to everyone that the buzz around the other companies isn't as big as they are made out to be.

  • So exactly like real humans? As a lot of humans cannot pass those puzzles too. But then again, you expect from AI to easily pass those puzzles. But in contrast to humans, AI can actually get better, whereas us regular humans cannot, yeah, there are and always will be wunderkinderen..
  • I've mentioned this before, but I had Gemini, ChatGPT, and Claude jointly design me an aircraft, along with its engines. The sheer intricacy and complexity of the problem is such that it can take engineers years to get to what all three AIs agree is a good design. Grok took a look at as much as it could, before running out of space, and agreed it was sound.

    Basically, I gave an initial starting point (a historic aircraft) and had each in turn fix issues with the previous version, until all three agreed on co

Two is not equal to three, even for large values of two.

Working...