Apple’s New Research Questions AI Reasoning Capabilities

Apple’s latest study shows that Large Reasoning Models struggle as tasks get more complex, exposing clear limits in their ability to reason. Despite sounding smart, these models often miss the mark when real thinking is required.

Apple & AI Reasoning

Share this crypto insight on your favorite social media platform

Key takeaways:

  • AI Fails Under Complexity: Even top-tier language models break down when tasks become more structurally complex, revealing limits in their reasoning depth.
  • More Reasoning Isn’t Always Better: Generating extra reasoning steps doesn’t guarantee smarter outcomes.
  • Benchmarks Are Evolving: New tools like ScholarBench and LogiEval are shifting the focus from answer accuracy to how well models think through problems.
  • Global Findings, Shared Concern: Researchers around the world are reaching the same conclusion: AI can sound smart, but it still struggles with real reasoning.

Apple’s machine learning division has released a worrying study on Large Reasoning Models (LRMs), a term used for large AI systems that combine natural language processing with structured reasoning capabilities.

In its study, titled “The Illusion of Thinking,” Apple introduced a custom-designed set of logic puzzles to evaluate how large language models approach reasoning tasks. These puzzles preserved consistent logical structures while varying in complexity, allowing researchers to observe not only final answers but also the intermediate reasoning steps used to reach them.

Using this controlled framework, the study uncovered five key limitations in Large Reasoning Models (LRMs): Inconsistent scaling, early reasoning collapse, failure on highly complex tasks, inconsistent algorithmic behavior, and unexpected underperformance on simpler problems.

According to the report, the results showed a consistent pattern. As task complexity increased, the models initially produced more elaborate reasoning. However, beyond a certain point, their responses became noticeably shorter, less precise, and disconnected from the task’s underlying logic, even when sufficient computational resources were available. This regression suggested that complexity alone can destabilize the internal reasoning processes of large AI models.

Additionally, when tested against standard Large Language Models (LLMs) using equal compute resources, LRMs showed three distinct performance patterns. On simple tasks, LLMs often outperformed LRMs, by delivering direct answers more efficiently. For tasks of moderate complexity, LRMs had the advantage thanks to their extended reasoning steps. However, on highly complex problems, both models struggled significantly, pointing to deeper architectural limitations.

AD 4nXf7GMFsJrb6c4CyPayUtH053EzDetCAq7Ui3Pgw3FdP81d1mTZxEUYrZt fZsyigTQRnbSur9TB5LUYnpWD3ZxHMA2QS4yxc
Source

Apple Calls for Rethinking AI Evaluation Metrics

The study concludes with a call to rethink how AI performance is measured. Rather than focusing solely on whether a model produces the correct answer, the researchers emphasize the importance of evaluating the quality and structure of AI models’ reasoning process.

Apple’s findings challenge the notion that more reasoning steps equal greater intelligence, highlighting instead the importance of reasoning that is not only accurate but also logically structured, context-aware, and properly scaled to the complexity of the task

New Tests, Same Result: AI Reasoning Still Falls Short

Studies from MIT and top Korean institutions, such as Hanyang University and KAIST, are echoing Apple’s concerns about AI’s reasoning limitations.

Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) found that even minor modifications to familiar tasks caused large language models to fail, highlighting their limited capacity for adaptable reasoning.

Meanwhile, ScholarBench, a multilingual academic benchmark developed by Hanyang University and KAIST, tested AI models across eight disciplines and revealed that, despite being state-of-the-art, they achieved an average score of just 0.5. The results exposed significant weaknesses in abstract logic and domain-specific reasoning.

Similarly, the newly introduced LogiEval benchmark assessed models on multiple forms of logical reasoning, deductive, inductive, analogical, and abductive. While some models performed adequately on basic argument formats, they consistently broke down on complex problems, exposing fragile logic and a lack of deeper reasoning ability.

Read more: AI Set to Surpass Bitcoin in Energy Use, Raising Sustainability Concerns

Disclaimer

All content provided on Times Crypto is for informational purposes only and does not constitute financial or trading advice. Trading and investing involve risk and may result in financial loss. We strongly recommend consulting a licensed financial advisor before making any investment decisions.

I’m a journalist, trader, and content specialist with over 9 years of experience spanning blockchain, crypto, finance, tech, and emerging industries. I turn complex ideas into clear, engaging narratives that connect, inform, and inspire.