Table of Contents
Introduction
One of the hottest debates in AI right now is about AI Reasoning: Can artificial intelligence truly reason—or is it just a fancy autocomplete machine?
Apple has just released a study that equates ‘thinking’ of AI to ‘complexity’, concluding that large language models are simply not reasoning. But Anthropic, the company behind Claude, was not of that opinion. They came back to the study with a paper titled “The Illusion of the Illusion of Thinking,” where they asserted that Apple was making mistakes in design of tests.
The topic of this Apple vs Anthropic AI debate is more than just a quarrel in science – it is about how we assess machine intelligence.
Apple’s Statement: AI Thinking is not More Than an Improv
Apple ran logic puzzles in some of the most advanced AI reasoning models:
- On easy puzzles, performance was excellent.
- On the more complicated ones, just like the Tower of Hanoi with 10 discs (which requires over 1,000 moves), accuracy collapsed.
Apple came to the conclusion that AI reasoning is not actually existing instead it works like a sophisticated autocomplete that does not hold under pressure.
Anthropic’s Counterclaim: Wrong Test, Wrong Conclusion
Anthropic reacted by calling Apple’s results misleading. Their most relevant points are:
- Apple wasn’t measuring reasoning; it was measuring how long an AI can type.
- Models have token limits. When they hit the cap, they often abbreviated responses like: “The pattern continues, but I’ll stop here.”
- Apple listed these as failures—even though the AI had demonstrated that it had understood the logic.
Basically, Apple mistook technical constraints for reasoning failure.
Flawed Puzzles, Flawed Judgments
Anthropic also identified that some of the problems in Apple’s tests were mathematically impossible.
- Example: asking six people to cross a river in a three-person boat with safety constraints that made the task unsolvable.
- Upon AI answering correctly: “this puzzle cannot be solved” Apple certified it as wrong.
Instead of exhibiting the reasoning flaws, AIs were punished for providing the right answers due to Apple’s setup.
Anthropic’s Test: Code Over Words
To solve that, Anthropic initiated a different way:
- Instead of making AI write a minimum of thousands of words, they asked it to write code that fixes the puzzle.
- As a result, the same models that Apple said “collapsed,” were almost perfect.
The conclusion: the problem lay not with AI’s ability to reason, but with the evaluation process.
Why This Debate Matters
This debate touches a central question in the AI investigation: How we make the reasoning measure?
- The stance of Apple: that AI thinking is fake, it is just autocomplete which has no deeper logic.
- Anthropic’s standpoint: flawed tests misrepresent AI reasoning capacities.
For AI researchers, investors, and strategists, this goes beyond theory – it comes down to designing fair benchmarks. If our testscannot differentiate between reasoning and typing, it is possible that we minimize on what these systems are really capable of.
Contextual Perspectives from Recent Research
To put the Apple vs. Anthropic clash in context, newer studies suggest that “reasoning collapse” can be a testing artifact rather than a fundamental limitation. A July 2025 paper finds that when models are allowed to use external tools—like a writable scratchpad or a Python interpreter—they consistently outperform standard LLMs across puzzle complexity levels, including tasks similar to Tower of Hanoi and River Crossing. In other words, once you remove output-window bottlenecks and let models compute (rather than type out thousands of steps), performance rebounds sharply.
This dovetails with Anthropic’s observation that many “failures” traced back to rigid evaluation choices: fixed token limits, penalizing concise continuations (e.g., “the pattern continues”), and even counting mathematically impossible puzzles as incorrect when models flagged them as unsolvable. Together, these findings argue that breakdowns often reflect benchmark design—not an inability to reason.
What This Means for AI Development
As models get more capable, how we measure reasoning matters as much as the architectures themselves. For builders and decision-makers reading AIIndexes.com, the takeaway is practical:
- Prefer tool-augmented evaluations (code generation, scratchpads, solvers) over “type-every-step” formats.
- Distinguish between reasoning failure and output/formatting constraints (token windows, enforced step listings).
- Scrutinize benchmarks for unsatisfiable constraints before turning scores into strategy.
Bottom line: AI reasoning isn’t “dead”—but some of our tests might be. Design evaluations that let systems think with tools, not just type within limits.
Also read – https://aiindexes.com/google-ai-overviews/