AI Models Still Struggle with Basic Reasoning, Scoring a Meager 0.43% on Key Benchmark
The latest AI models from OpenAI and Anthropic have failed to crack the 1% mark on the ARC-AGI-3 benchmark, with GPT-5.5 scoring 0.43% and Opus 4.7 managing just 0.18%, highlighting significant gaps in their reasoning capabilities. This poor performance has major implications for developers and businesses relying on these models for critical tasks.
The AI community has been abuzz with the latest results from the ARC-AGI-3 benchmark, which puts the world's most advanced AI models to the test in interactive, turn-based game environments. The benchmark requires AI agents to explore environments, form hypotheses, and carry out action plans without any instructions, making it a true test of their reasoning capabilities. Unfortunately, the latest models from OpenAI and Anthropic have failed to impress, with GPT-5.5 scoring a mere 0.43% and Opus 4.7 managing a paltry 0.18%. These scores are not only underwhelming but also highlight significant gaps in the reasoning capabilities of these models.
The ARC-AGI-3 benchmark is particularly noteworthy because it evaluates AI models in a more nuanced way than traditional benchmarks. Rather than simply passing or failing, the benchmark provides detailed reasoning traces that allow developers to see exactly where their models are going wrong. In the case of GPT-5.5 and Opus 4.7, the analysis revealed three systematic error patterns that are holding them back. The most common pattern is the inability to turn local effects into a working world model, which means that the models can recognize specific actions but fail to understand how they fit into the broader context of the game.
This limitation is starkly illustrated in the game cd82, where Opus 4.7 correctly identifies that a certain action rotates an object but fails to grasp the overarching game mechanics. Despite recognizing that ACTION3 rotates a container and ACTION5 pours paint, the model never connects these observations to realize that it needs to align the bucket and then dip it to reproduce the target image. This failure to see the big picture is a major concern for developers and businesses relying on these models for critical tasks, as it suggests that they may not be able to generalize well to new situations.
The poor performance of GPT-5.5 and Opus 4.7 on the ARC-AGI-3 benchmark is also notable when compared to their predecessors. The cost of running these models is substantial, with GPT-5.5 requiring around $10,000 to achieve its lackluster score. This raises questions about the value proposition of these models, particularly when humans can solve the same tasks with no prior knowledge. The fact that no frontier model has cracked the 1% mark on the ARC-AGI-3 leaderboard is a sobering reminder of the significant challenges that still need to be overcome in the development of artificial general intelligence.
The implications of these results are far-reaching, with significant consequences for developers, businesses, and everyday users. For developers, the poor performance of GPT-5.5 and Opus 4.7 highlights the need for more nuanced evaluation metrics that can identify the specific weaknesses of their models. For businesses, the high cost and limited capabilities of these models raise concerns about their suitability for critical tasks, such as decision-making and problem-solving. And for everyday users, the limitations of these models mean that they may not be able to rely on them for tasks that require complex reasoning and decision-making.