Published under the title The Illusion of Thinking, the paper from Apple’s machine learning team provides a comprehensive critique of reasoning-enhanced large language models (LLMs). These are versions of AI that attempt to solve problems by simulating multi-step logic. But when put through their paces, the study suggests, these models often collapse under pressure.
Thinking Isn’t Always Smarter
Apple’s researchers tested various models on puzzle-based tasks, where the difficulty could be precisely increased. The puzzles were chosen because they demand logical thinking and can’t be solved simply by guessing what “looks right.”
Performance was grouped into three categories:
-
Simple Tasks: Basic models, without added reasoning chains, often performed better. They solved easier problems faster and more accurately.
-
Intermediate Tasks: This is the sweet spot for reasoning models. They showed some improvement over regular LLMs—but only briefly.
-
Complex Tasks: Once tasks became genuinely difficult, all models failed. Crucially, the ones built to reason did not just struggle—they regressed, producing weaker output and fewer steps of thought.
What stood out most was the unexpected drop in performance as complexity increased. Rather than ramping up their reasoning to meet the challenge, models began to offer less reasoning altogether. This suggests that their thinking ability may be more cosmetic than structural.
A Closer Look at the Cracks
In one test, models were asked to solve the Tower of Hanoi puzzle—an old logic game familiar to anyone who’s studied basic computer science. With just a few discs, most models performed fine. But as the number of discs increased, success rates plunged. Even when fed the exact algorithm, models couldn’t reliably carry it out.
The issue wasn’t memory or compute limits. These AI systems had room to process more steps. They simply didn’t. In many cases, they stopped reasoning halfway through, abandoned logic midstream, or returned to earlier incorrect ideas. Even a slight change in the way a question was worded could break their performance.
This behaviour hints at a deeper limitation. Rather than working through a problem logically, as humans might, these models appear to rely on surface-level pattern matching. The reasoning traces they output may look convincing—but Apple’s paper found they often have little to do with how the final answer is reached.
Models That Overthink and Underdeliver
One striking insight was how often models veered off course after finding a correct solution. Instead of stopping there, they kept going—sometimes reversing their own correct steps. This phenomenon, known as "overthinking," resulted in answers being buried under layers of unnecessary logic.
Researchers noted a sharp divide depending on problem complexity. In easier puzzles, models often solved the task early, then talked themselves out of it. In mid-range puzzles, they wandered through trial and error before arriving at a partial answer. On the hardest tasks, correct solutions vanished entirely.
In short, the more you ask these systems to think, the less helpful their thinking becomes.
Not a Death Knell—But a Warning Bell
Apple is not calling for the end of reasoning models. Instead, the report urges a rethink about what these systems actually do. It also warns developers and businesses not to assume that “more thought” from a model automatically equals better results.
“Reasoning models aren’t useless,” one senior researcher said, “but their strengths are very specific—and their weaknesses are too often overlooked.”
This message lands at a sensitive moment. As rivals like Google, Microsoft and OpenAI rush to build general-purpose AI systems that reason, plan and even argue, Apple is taking a different path. Its AI focus remains grounded in efficient, privacy-first features that run on users’ devices.
The company’s new Apple Intelligence tools—rolled into iOS 18 and macOS Sequoia—emphasise helpfulness over generality. Features like writing suggestions, notification summaries and image generation aim to make day-to-day tasks smoother, not to mimic human reasoning.
Between Promise and Premature Praise
While some in the AI community have welcomed Apple’s study as a much-needed reality check, others see it as overly cautious. Critics point out that reasoning models are still in their infancy and argue that dismissing their progress so early could limit innovation.
But Apple’s data is hard to ignore. The company’s researchers did not rely on abstract benchmarks or vague tasks. They built controlled environments, adjusted puzzle complexity step by step, and measured not only answers but the logic behind them.
The result is a study that asks one key question: if today’s AI still stumbles on problems that children—or even simple algorithms—can solve, how close are we really to machines that think?
For now, it seems, the illusion of reasoning may be just that—an illusion.
Image: DIW-Aigen
Read next: Tracking the World’s Clicks: Daily Rankings of Top Search Platforms