Apple’s AI Critique Faces Pushback Over Flawed Testing Methods

A recent research paper from Apple raised eyebrows in the AI community after suggesting that today’s most advanced language models fail dramatically when faced with complex reasoning tasks. But that conclusion is now being challenged, not because the tasks were too difficult, but because, critics argue, the experiments weren’t fairly designed to begin with.

Alex Lawsen, a researcher at Open Philanthropy, has responded with a counter-study questioning the foundations of Apple’s claims. His assessment, published this week, argues that the models under scrutiny (including Claude, Gemini, and OpenAI’s latest systems) weren’t breaking down due to cognitive limits. Instead, he says they were tripped up by evaluation methods that didn’t account for key technical constraints.

One of the main flashpoints in the debate is the Tower of Hanoi, a well-known puzzle often used to test logical reasoning. Apple’s paper reported that models consistently failed when the puzzle became more complex - typically at eight disks or more. But Lawsen points out a critical issue that the models weren’t failing to solve the puzzle. They were often simply stopping short of writing out the full answer because they were nearing their maximum token limit - a built-in cap on how much text they can output in one go.

In several cases, the models even stated they were cutting themselves off to conserve output space. Rather than interpreting this as a practical limitation, Apple’s evaluation counted it as a failure to reason.

A second issue arose in the so-called River Crossing test, where models are asked to solve a version of the Missionaries and Cannibals puzzle. Apple included setups that were mathematically unsolvable, for example, asking the model to ferry six or more agents using a boat that could only carry three at a time. When the models recognized that the task couldn’t be completed under the given rules and refused to attempt it, they were still marked wrong.

A third problem involved how Apple’s system judged the responses. It relied on automatic scripts to evaluate output strictly against full, exhaustive solutions. If a model produced a correct but partial answer (or took a strategic shortcut) it still received a failing score. No credit was given for recognizing patterns, applying recursive logic, or even identifying the task’s limitations.

To illustrate how these issues can distort results, Lawsen ran a variation of the Hanoi test with a different prompt. Instead of asking the models to list every move, he instructed them to write a small program (in this case, a Lua function) that could solve the puzzle when executed. Freed from the burden of listing hundreds of steps, the models delivered accurate, scalable solutions, even with 15 disks - well beyond the point where Apple’s paper claimed they failed entirely.

The implications go beyond academic nitpicking. Apple’s conclusions have already been cited by others as evidence that large AI models lack the kind of reasoning needed for more ambitious tasks. But if Lawsen’s analysis holds up, it suggests the story is more complicated. The models may struggle with long-form answers under tight output limits, but their ability to think through a problem algorithmically remains intact.

Of course, none of this means large reasoning models are problem-free. Even Lawsen acknowledges that designing systems that can reliably generalize across unfamiliar problems remains a long-term challenge. His paper calls for more careful experimentation i.e., tests should check whether puzzles are actually solvable, track when models are being truncated due to token budgets, and consider solutions in multiple formats, from plain text to structured code.

The debate boils down to a deeper question, are we really measuring how well machines think, or just how well they can type within a fixed character limit?

Image: DIW-Aigen

Read next: ChatGPT Linked to Delusions, Self-Harm, and Escalating Mental Health Crises
Previous Post Next Post