AI Model Reaches New Heights: 19-Day Nonstop Coding Session Costs $2,600
A recent benchmark test has pushed the limits of AI coding capabilities, with one model completing a complex task in 19 days at a cost of $2,600. The test, known as MirrorCode, evaluates AI models' ability to recreate entire programs from scratch without access to the original source code.
The MirrorCode benchmark, developed by Epoch AI and METR, has set a new standard for evaluating AI coding capabilities. The test consists of 25 target programs that cover a wide range of computer science domains, including Unix utilities, data serialization, bioinformatics, and cryptography. The AI models are required to reimplement these programs exactly, including hidden end-to-end tests that they have never seen before. The results are impressive, with the top-performing model, Claude Opus 4.7, achieving a 56 percent solve rate and reimplementing a bioinformatics toolkit with 16,000 lines of code in just 14 hours. This feat would take a human engineer between 2 to 17 weeks to complete, demonstrating the significant potential of AI in software development.
The cost of running these tests is substantial, with one of the largest tasks in MirrorCode costing $2,600 for a single run. This is significantly higher than the $1 to $10 per task cost cap of many other benchmarks, reflecting the complexity and scope of the MirrorCode test. Despite the high cost, the results are promising, with Claude Opus 4.7 finishing the bioinformatics toolkit reimplementation for $251. In comparison, GPT-5.5 and Gemini 3.1 Pro Preview achieved solve rates of 44 percent and 32 percent, respectively.
The MirrorCode benchmark has far-reaching implications for the field of software development. As AI models become increasingly capable of handling complex coding tasks, they can free up human developers to focus on higher-level tasks that require creativity and problem-solving skills. This can lead to significant productivity gains and improved software quality. Furthermore, the ability of AI models to reimplement entire programs from scratch can help to identify and fix bugs, reducing the risk of errors and improving overall system reliability.
The progress made by Claude Opus 4.7 and other models is a significant step forward from previous benchmarks. Earlier tests have focused on smaller, more specialized tasks, whereas MirrorCode evaluates the ability of AI models to handle complex, real-world programming tasks. The results demonstrate that AI has made significant strides in recent years, and the technology is now capable of tackling tasks that were previously thought to be the exclusive domain of human developers.
Despite the impressive results, there is still room for improvement. The most complex tasks in the MirrorCode benchmark remain unsolved, and even the top-performing models struggle to fully reimplement certain programs. However, the fact that these models can pass 90 percent or more of the tests even when they fail to fully reimplement a program demonstrates their potential for handling demanding programming tasks.
The MirrorCode benchmark matters for AI model users and developers because it sets a new standard for evaluating the capabilities of AI coding models. As the technology continues to evolve, benchmarks like MirrorCode will play a crucial role in driving innovation and improvement. By pushing the limits of what is possible with AI, developers can create more powerful and efficient models that can handle complex tasks with ease. This, in turn, can lead to significant benefits for businesses and everyday users, from improved software quality to increased productivity and reduced costs. Ultimately, the MirrorCode benchmark is an important step forward in the development of AI coding capabilities, and its results will have far-reaching implications for the future of software development.