Speed vs intelligence trade-offs, with picks for every use case
There are two kinds of speed that matter in AI: throughput (tokens per second) and latency (time to first token). They measure different things and matter for different use cases. A model that outputs 800 tokens per second but takes 4 seconds to start is great for batch processing and terrible for chat.
In 2026, the speed landscape has stratified sharply. Purpose-built speed models like Mercury reach 835 tok/s. Frontier intelligence models like GPT-5 sit around 77 tok/s. And the sweet spot -- models that are both fast and smart -- is where the interesting choices live.
This review breaks down the fastest models, when speed actually matters, and how to find the right trade-off between thinking and talking.
Throughput (tokens per second) tells you how fast a model generates text once it starts. High throughput matters when you're generating long outputs -- a 2,000-word response at 200 tok/s takes 3 seconds; at 50 tok/s, it takes 12 seconds.
Latency (TTFT, time to first token) tells you how long you wait before anything appears. Low latency matters for interactive experiences where users are watching the cursor. A chat that starts responding in 5ms feels instant. One that takes 150ms has a noticeable pause.
For code completion and autocomplete, latency is everything. For content generation and batch processing, throughput matters more. For chat applications, both contribute to the "feel" of the experience.
Here's how the fastest models compare across both dimensions of speed, with intelligence for context.
| Model | Speed (tok/s) | Latency (TTFT) | Intelligence | Price/1M |
|---|---|---|---|---|
| Mercury 2 | 835 tok/s | 3.86s | 32.8 | $0.25 |
| Nemotron 3 Super | 395 tok/s | 0.57s | 36.0 | $0.10 |
| Ministral 3 3B | 292 tok/s | 0.26s | 11.2 | $0.10 |
| gpt-oss-20b |
Inception's Mercury models dominate raw throughput at 835 tok/s, more than double any other model. But there's a catch: the intelligence score of 32.8 puts it well below the frontier.
Mercury is built for a specific use case: high-volume, relatively simple text processing. Think real-time content moderation, fast summarization, or any application where you need to process thousands of requests per minute and "good enough" quality is acceptable.
The 3.86-second TTFT is also notable -- that's high latency before the first token appears. Mercury is a sprinter, not a quick-draw. It takes a moment to warm up, then blazes through generation.
The real winners are models that balance speed with intelligence. Three stand out:
Gemini 3 Flash Preview at 191 tok/s combines high speed with a 46.4 intelligence score and 42.6 coding score. At $0.50/1M, it's the model most developers should be using for speed-sensitive applications. Its 4.8ms TTFT means it starts responding essentially instantly.
Grok 4 from xAI outputs at 124 tok/s with 48.5 intelligence and a 16ms TTFT. It's more expensive at $3.00/1M but offers a 2M context window that no other fast model can match.
Gemini 3.1 Pro at 117 tok/s delivers top-tier 57.2 intelligence. It's not a "speed model," but it's fast enough for most interactive applications while giving you the best reasoning available.
Different applications have different speed requirements:
Code completion and autocomplete: You need sub-100ms TTFT and at least 100 tok/s. Gemini 3 Flash (4.8ms TTFT, 191 tok/s) is the clear winner. Users perceive anything over 200ms as laggy in autocomplete.
Chat applications: TTFT under 50ms and 80+ tok/s make for a responsive chat experience. Gemini 3.1 Pro (22ms, 117 tok/s) and Grok 4 (16ms, 124 tok/s) both qualify.
Live translation or captioning: You need sustained high throughput more than low latency. Nemotron 3 Super (395 tok/s) is ideal for continuous streaming scenarios.
Batch processing: Throughput is all that matters. Mercury (835 tok/s) processes data fastest, but gpt-oss-20b at 288 tok/s and $0.03/1M might be more cost-effective for large batches.
Faster models tend to be cheaper because speed often correlates with smaller model size. Mercury at $0.25/1M processes 835 tok/s. GPT-5 at $1.25/1M processes 77 tok/s. Per token, GPT-5 costs 5x more and takes 10x longer.
But tokens-per-second isn't the whole cost picture. If you need a frontier model to get the job right on the first try (avoiding expensive retries), the slower, smarter model can be cheaper in total cost of ownership.
The calculation is simple: if a $0.25 model needs 3 retries to get it right, it costs $0.75 total. If a $1.25 model gets it right the first time, it's cheaper. For simple tasks, the fast model wins. For complex tasks, the smart model wins.
The optimal strategy uses both: route simple queries to fast/cheap models and complex queries to smart/slow ones.
Speed measurements (tokens per second and TTFT) reflect median values observed across standardized prompts. These measurements can vary based on server load, prompt complexity, and output length. All speed data is updated daily. Pricing reflects current API provider rates.
For most developers building interactive applications, Gemini 3 Flash Preview at $0.50/1M offers the best combination of speed (191 tok/s), intelligence (46.4), and low latency (4.8ms). For chat applications where quality matters more, Gemini 3.1 Pro at 117 tok/s gives you top-tier intelligence without feeling slow. And for raw throughput on simple tasks, NVIDIA Nemotron 3 Super at 395 tok/s and $0.10/1M is unbeatable on value.
Published April 3, 2026. Data updated daily from independent benchmarks and API providers.
| 288 tok/s |
| 0.47s |
| 24.5 |
| $0.03 |
| gpt-oss-120b | 275 tok/s | 0.50s | 33.3 | $0.04 |
| Meta Llama 3.1 8B | 192 tok/s | -- | 11.8 | $0.02 |
| Gemini 3 Flash | 191 tok/s | 4.79ms | 46.4 | $0.50 |
| Gemini 3.1 Pro | 117 tok/s | 21.9ms | 57.2 | $2.00 |
| Grok 4 | 124 tok/s | 16.0ms | 48.5 | $3.00 |
| GPT-5 | 77 tok/s | 147ms | 57.2 | $1.25 |