A 78% non-hallucination rate sets a new record. But a 48-point intelligence score leaves it outside the top 5.
xAI's Grok 4.20 makes an unusual trade: it's the most honest AI model ever measured, but it's not the most capable. With a 78% non-hallucination rate on the Artificial Analysis Omniscience test, it sets a record for truthfulness. But its 48-point intelligence score puts it at #12-13 overall, behind models from OpenAI, Google, Anthropic, and even Xiaomi. Is honesty worth the capability trade?
Grok 4.20's headline feature is truthfulness. Where other models confidently state incorrect information, Grok more often says 'I'm not sure' or provides caveats. The 78% non-hallucination rate on the Omniscience test is significantly ahead of the competition.
For applications where wrong answers are worse than no answer — medical information, legal research, financial advice, education — this matters enormously. A model that admits uncertainty is more trustworthy than one that sounds confident while being wrong.
The multi-agent variant takes this further, using multiple reasoning paths and cross-checking its own work before producing a final answer.
At 48.5 on the Intelligence Index, Grok 4.20 trails the leaders significantly. Gemini 3.1 Pro and GPT-5.4 at 57.2 represent a 18% advantage. Claude Opus 4.6 at 53.0 is 9% ahead.
This gap shows on complex reasoning tasks, multi-step problems, and tasks requiring deep knowledge synthesis. Grok handles straightforward questions well but struggles more than the frontier models on the kinds of problems that require creative reasoning.
The coding score of 42.2 tells a similar story: competitive with models like Qwen3.5 and Gemini 3 Flash, but behind the GPT-5.x and Claude lineups.
The pricing story is positive. At $20/$60 per million tokens for all three variants (reasoning, non-reasoning, multi-agent), Grok 4.20 is 33% cheaper than Grok 3 on input and 60% cheaper on output.
Output speed of 157.8 tokens per second is excellent — more than double the median for reasoning models and faster than GPT-5.4 (77 tok/s) and Claude Opus 4.6 (51 tok/s). For interactive use, Grok feels notably snappier than the frontier models.
The 2M token context window is among the largest available, useful for processing very long documents or extensive codebases.
Grok 4.20 is the right choice if truthfulness is your top priority. Education platforms, fact-checking tools, medical information systems, and any application where hallucinations cause real harm will benefit from Grok's honesty-first design.
It's also a strong choice for users who value speed (157.8 tok/s) and large context (2M tokens) over maximum benchmark performance.
It's not the right choice if you need the absolute best performance on coding, math, or complex reasoning. For those tasks, GPT-5.4, Gemini 3.1 Pro, or Claude Opus 4.6 are better despite being less honest.
Non-hallucination rate from the Artificial Analysis Omniscience benchmark. Intelligence and Coding scores from the AA indices. Speed measurements from AA median P50. Pricing from xAI's published API rates.
Grok 4.20 doesn't compete with GPT-5.4 or Gemini on raw intelligence. What it offers instead is unprecedented honesty and very fast output at reasonable prices. For applications where trust matters more than capability, it's the best choice. For everything else, the frontier models remain ahead.
Published May 7, 2026. Data updated daily from independent benchmarks and API providers.