BenchmarkMarch 30, 20261 min read
AI Models Fabricate Visual Details 70-80% of the Time, Exposing Critical Flaw in Benchmarking
A recent study reveals that top multimodal AI models, including GPT-5 and Gemini 3 Pro, confidently describe images they've never seen, with alarming implications for medical and safety-critical applications. This 'mirage effect' affects all tested models, achieving 70-80% of benchmark results without visual input, and has serious consequences for developers and users who rely on these models.
Multimodal AI models like GPT-5, Gemini 3 Pro, and Claude Opus 4.5 generate detailed image descriptions and medical diagnoses even when no image is provided. A Stanford study shows that common benchmarks obscure the problem. The article AI models confidently describe images they never saw, and benchmarks fail to catch it appeared first on The Decoder.