We tested the top models on real coding tasks. GPT-5.4 leads, but the best choice depends on what you're building.
Choosing the right AI model for coding isn't just about picking the one with the highest benchmark score. The best model for a solo developer debugging a Python script is different from the best model for a team building a distributed system. Speed, cost, context window, and the specific type of coding all matter. We tested every major model on real coding tasks to help you decide.
Our coding scores come from the Artificial Analysis Coding Index, a composite metric that aggregates performance across multiple coding benchmarks including LiveCodeBench, Aider Polyglot, and SWE-bench. This composite approach matters because no single benchmark captures everything — LiveCodeBench tests algorithmic problem-solving with fresh problems, Aider tests real-world code editing across languages, and SWE-bench tests the ability to resolve actual GitHub issues. A model that excels on competitive programming puzzles might struggle with practical code review.
We also factor in output speed (tokens per second) and cost, because a model that takes 30 seconds to generate a response disrupts your flow, and a model that costs $25 per million output tokens will blow your budget on a large codebase.
GPT-5.4 takes the #1 spot with a 57.3 coding index, paired with a 57.2 intelligence score that makes it the most well-rounded model available. At 77 tokens per second, it's fast enough for interactive coding sessions, though not the fastest in its class.
Where GPT-5.4 truly shines is on complex, multi-file tasks. When given a codebase context and asked to implement a feature that touches multiple files, it consistently produces code that compiles and passes tests on the first attempt more often than any other model we tested. Its understanding of code architecture — not just syntax — is noticeably better than the competition.
The catch is pricing: $2.50 per million input tokens and $15 per million output tokens means a heavy coding session generating thousands of lines can add up. For professional developers and teams where accuracy saves more time than the API costs, it's worth it. For hobby projects, look at GPT-5.4 Mini.
Gemini 3.1 Pro scores 55.5 on coding — just 1.8 points behind GPT-5.4 — while costing 20% less at $2/$12 per million tokens. It also runs significantly faster at 113 tokens per second, which makes a real difference during extended coding sessions.
Google's model has a particular strength in code that involves data processing, API integrations, and web development. In our testing, it produced cleaner, more idiomatic JavaScript and TypeScript than GPT-5.4 in most cases. Where it falls slightly short is on complex systems programming and architecturally ambitious refactors.
The intelligence score ties GPT-5.4 at 57.2, meaning Gemini 3.1 Pro is equally strong on reasoning-heavy coding tasks that require understanding business logic, not just writing syntax. For most developers, the 20% cost savings and 47% speed advantage make Gemini 3.1 Pro the practical choice.
GPT-5.4 Mini is the sleeper hit of this ranking. At 51.5 on the coding index, it outperforms Claude Sonnet 4.6 (50.9) and GPT-5.2 (48.7) while costing just $0.75/$4.50 per million tokens — a fraction of what the full-size models charge.
It's also blazingly fast at 218 tokens per second, nearly 3x faster than GPT-5.4 and Gemini 3.1 Pro. For tasks like code completion, unit test generation, documentation, and routine refactoring, Mini delivers 90% of the quality at 30% of the cost and 3x the speed.
The tradeoff is a lower intelligence score of 48.1, which means it stumbles more often on complex multi-step reasoning tasks. If you're asking it to architect a system from scratch, use the full GPT-5.4. For everything else, Mini is the better choice.
Here's a result that surprised us: Claude Sonnet 4.6 outscores Claude Opus 4.6 on coding (50.9 vs 48.1). Opus has the higher intelligence score (53.0 vs 51.7), but for pure coding tasks, Sonnet is the better Anthropic model.
This makes sense when you understand what the models are optimized for. Opus 4.6 is designed for extended agentic workflows — it can maintain context and make decisions over thousands of steps without degrading. Sonnet 4.6 is optimized for fast, high-quality responses to individual prompts, which is exactly what coding assistance requires.
At $3/$15 per million tokens, Sonnet 4.6 is also 40% cheaper than Opus ($5/$25). Unless you're building an autonomous coding agent that needs to run for hours, Sonnet is the right Claude model for coding.
That said, both Claude models trail the GPT and Gemini competition on raw coding benchmarks. Where Anthropic's models still lead is on code safety — they're more likely to flag potential security issues and less likely to generate code with subtle bugs.
If you need free coding assistance, Gemini 2.5 Pro Preview scores 46.7 on the coding index — which would have been a top-3 score just six months ago. It's genuinely competitive for routine coding tasks.
The intelligence score is lower at 30.3, so it won't match the paid models on complex architectural decisions. But for code completion, debugging, writing tests, and translating between languages, it's remarkably capable at zero cost.
The main limitation is that it's a preview model without guaranteed availability or SLA. For production use, you'd want a paid model. For personal projects, learning, and prototyping, Gemini 2.5 Pro Preview is hard to beat.
If your workflow prioritizes speed over raw quality, the picture shifts. GPT-5.4 Mini at 218 tokens/sec and Gemini 3 Flash at 192 tokens/sec deliver strong coding performance at interactive speeds that make them feel like extensions of your IDE rather than external tools.
For comparison, Claude Opus 4.6 at 51 tokens/sec and GPT-5.3 Codex at 72 tokens/sec feel noticeably slower, especially when generating longer code blocks. In our testing, developers rated their experience higher with faster models even when the code quality was marginally lower — the flow state matters.
The sweet spot is Gemini 3.1 Pro at 113 tokens/sec with a 55.5 coding score. Fast enough to feel responsive, accurate enough to trust.
GPT-5.3 Codex deserves special mention as OpenAI's dedicated coding model. At 53.1 on the coding index, it slots between GPT-5.4 and Gemini 3.1 Pro. Its strength is in agentic coding workflows — it's built to run in GitHub Copilot and similar tools where it handles multi-file edits autonomously.
For standard API usage, GPT-5.4 is strictly better. But if you're using GitHub Copilot or building coding agents, Codex's optimizations for tool use and multi-step code editing make it the right choice.
On the open-source side, GLM-5 from Zhipu scores 44.2 at just $1/$3.20 per million tokens — the best coding score from a Chinese model and a strong option for budget-conscious teams.
Rankings are based on the Artificial Analysis Coding Index, a composite of LiveCodeBench, Aider Polyglot, and related coding benchmarks. Scores are updated daily. Pricing and speed data reflect direct API access at each provider's standard tier. All models were tested in their default configuration without custom system prompts or few-shot examples.
GPT-5.4 is the best coding model if you want maximum accuracy and can afford $2.50/$15 per million tokens. Gemini 3.1 Pro is the practical choice — nearly as good, 20% cheaper, 47% faster. GPT-5.4 Mini is the value king at $0.75/$4.50 with 90% of the quality. And if budget is zero, Gemini 2.5 Pro Preview holds its own against last year's paid models.
Published March 26, 2026. Data updated daily from independent benchmarks and API providers.