BenchmarkMarch 30, 20261 min read

AI Models Fabricate Visual Details 70-80% of the Time, Exposing Critical Flaw in Benchmarking

A recent study reveals that top multimodal AI models, including GPT-5 and Gemini 3 Pro, confidently describe images they've never seen, with alarming implications for medical and safety-critical applications. This 'mirage effect' affects all tested models, achieving 70-80% of benchmark results without visual input, and has serious consequences for developers and users who rely on these models.

Multimodal AI models like GPT-5, Gemini 3 Pro, and Claude Opus 4.5 generate detailed image descriptions and medical diagnoses even when no image is provided. A Stanford study shows that common benchmarks obscure the problem. The article AI models confidently describe images they never saw, and benchmarks fail to catch it appeared first on The Decoder.

Models Mentioned

Claude Opus 4.5 (Reasoning)

Gemini 3 Pro Preview (high)

GPT-5.4 nano (xhigh)

Browse Models Compare All News

AI Models Fabricate Visual Details 70-80% of the Time, Exposing Critical Flaw in Benchmarking

Models Mentioned

AI-Powered Students See 24% Drop in Exam Scores After Two Years

Explore