AI Benchmarks Exposed: Computing Budgets Mask True Capabilities by Up to 25%
A recent study reveals that standard AI benchmarks systematically underestimate the capabilities of AI agents, with success rates increasing by up to 25% when given more computing time. This finding has significant implications for developers, businesses, and everyday users who rely on AI models for various tasks.
In a study covering seven benchmarks, the UK's AI Security Institute shows that standard AI evaluations systematically underestimate agent capabilities by capping the compute budget. On software engineering tasks, success rates jumped about 25 percent when the token budget was increased tenfold. Newer models benefit the most. Depending on the token budget, actual progress at the frontier is about 60 percent steeper than previous measurements suggested, according to AISI. The article UK's AI Security Institute finds standard benchmarks systematically underestimate what AI agents can actually do appeared first on The Decoder.