Mercury 2 hits 894 tok/s. But the fastest model isn't always the best choice.
Speed matters more than most people think. A 200ms response feels instant. A 2-second response feels slow. A 10-second response feels broken. For interactive applications — chatbots, code assistants, real-time search — output speed directly determines user experience. We ranked every model by tokens per second to find the fastest options at every quality level.
Tokens per second (tok/s) measures how fast the model generates output after it starts responding. A 200 tok/s model producing a 400-token response takes 2 seconds. A 50 tok/s model takes 8 seconds for the same response.
There's also latency (time to first token / TTFT) — how long you wait before the model starts responding at all. A model with great tok/s but 3-second TTFT still feels slow for the first few seconds.
For chat interfaces, tok/s matters most because users see text streaming in. For API batch processing, total completion time (latency + generation) matters more.
Tier 1 — Blazing (200+ tok/s): Mercury 2 (894), Granite 3.3 8B (402), Gemini 2.5 Flash-Lite (389), NVIDIA Nemotron 3 Super (365), GPT-5.4 Mini (218), GPT-5.4 Nano (216).
Tier 2 — Fast (100-200 tok/s): Gemini 3 Flash (192), Gemini 3.1 Pro (113), GPT-5.2 Codex (99), GPT-5.1 (92).
Tier 3 — Moderate (50-100 tok/s): GPT-5.4 (77), GPT-5.3 Codex (72), Claude Sonnet 4.6 (71), GPT-5.2 (70), GLM-5 (66).
Tier 4 — Slow (<50 tok/s): Claude Opus 4.6 (51), Claude Opus 4.5 (57).
The pattern is clear: smaller and newer models are faster. Reasoning models (which think before answering) tend to be slower but more accurate.
The sweet spots where you get both speed and quality:
GPT-5.4 Mini (218 tok/s, 48.1 intel, 51.5 coding) — the best balance overall. Fast enough for real-time, smart enough for serious work.
Gemini 3.1 Pro (113 tok/s, 57.2 intel) — the fastest frontier model. If you need maximum intelligence without sacrificing speed.
Gemini 3 Flash (192 tok/s, 46.4 intel) — Google's speed-optimized model. Trades some intelligence for near-instant responses.
GPT-5.4 Nano (216 tok/s, 44.4 intel) — OpenAI's lightweight model. Best for high-volume, low-complexity tasks where every millisecond counts.
Claude Opus 4.6 at 51 tok/s is the slowest frontier model. But speed isn't everything. Opus excels at tasks where thinking time is an asset: complex code review, lengthy document analysis, multi-step planning.
For agentic workflows where the model runs autonomously for minutes or hours, the difference between 51 tok/s and 218 tok/s is negligible — the bottleneck is reasoning quality, not generation speed.
Rule of thumb: if your task involves real-time user interaction, prioritize speed. If your task runs in the background, prioritize quality.
Speed measured as median output tokens per second (P50) over the past 72 hours by Artificial Analysis. Measurements taken on standard API endpoints, not optimized inference setups. Your actual speed may vary based on provider load and request parameters.
GPT-5.4 Mini offers the best speed-quality balance at 218 tok/s with strong benchmarks. Gemini 3.1 Pro is the fastest frontier model. For pure speed with decent quality, Mercury 2 is untouchable at 894 tok/s. For most applications, anything above 100 tok/s is fast enough.
Published April 3, 2026. Data updated daily from independent benchmarks and API providers.