From OpenAI's o3 to Grok-3 Vision: These AI Models Took the Mensa Test, Results May Surprise You

A recent intelligence benchmark has placed today’s most advanced AI models under the same kind of cognitive scrutiny used to assess exceptional human thinkers, and the outcome tells a story of contrasts between raw verbal reasoning and multimodal complexity.

The data comes from the Mensa Norway IQ test, a well-known measure of high-level reasoning, where scores above 130 often mark out genius-level ability. Although the test was designed for people, researchers have begun using it to compare how artificial intelligence systems perform when asked to solve the same kinds of abstract problems humans struggle with.

At the top of the current rankings sits OpenAI’s o3, which scored 133, just shy of the upper boundary of human IQ scales. Not far behind is Gemini Thinking, Google’s language-focused model, which reached 128. These results suggest that, at least in abstract problem solving through words and logic, some AI systems are not just matching human performance but quietly exceeding it.

The upper tier includes OpenAI’s o4-mini with a score of 126, Gemini Pro at 124, and both Claude-4 Opus and Claude-4 Sonnet tied at 118. Even models just below this line, like Grok-3 Think (111), Llama-4 (107), and DeepSeek-R1 (105), are operating within or above the average human range.

Also read: ChatGPT Usage Statistics: Numbers Behind Its Worldwide Growth and Reach (June, 2025)

But the drop-off begins sharply as models shift from text-only processing to visual capabilities. Systems like Claude-4 Sonnet Vision, GPT-4.5, Grok-3, and deepseek-v3, all scoring 97, sit right at the border of human average. Just below them, Gemini Pro Vision landed at 96, while GPT-4 Omni (Verbal) trailed at 91, despite its verbal focus.

OpenAI’s o4-mini-high reached 90, but the decline continues. Visual variants such as o3-vision and Bing’s AI scored 86, followed by Mistral (85) and Claude-4 Opus Vision (80). Further down the list, models like OpenAI o1-pro Vision (79) and Llama-3 Vision (70) show a widening gap between multimodal ambition and actual performance on reasoning tasks.

At the lowest end sit GPT-4 Omni Vision and Grok-3 Think Vision, managing only 63 and 62 respectively — scores that, in human terms, would reflect severe limitations in pattern recognition and logic.

What becomes clear through this ranking is that text-based reasoning remains AI’s strong suit. Models trained purely on language continue to outperform their multimodal counterparts when faced with symbol-based puzzles and logic problems. While vision-enabled AIs might be better suited for real-world perception, they appear less capable when reasoning is abstracted from context and stripped to logic alone.

These findings underscore a split in the development arc of artificial intelligence. Verbal models are now working at, and sometimes above, human cognitive levels. But giving machines the ability to “see” doesn’t yet mean they understand. At least not in the ways intelligence is traditionally measured.

Category	Mensa Norway IQ Test Score
OpenAI o3	133
Gemini Thinking	128
OpenAI o4-mini	126
Gemini Pro	124
Claude-4-Opus	118
Claude-4-Sonnet	118
Grok-3-Think	111
Llama-4	107
DeepSeek-R1	105
OpenAI o1-pro	102
Average Human	100
Claude-4-Sonnet-Vision	97
gpt-4.5	97
deepseek-v3	97
Grok-3	97
Gemini Pro (Vision)	96
GPT4 Omni (Verbal)	91
OpenAI o4-mini-high	90
OpenAI o3-vision	86
Bing	86
Mistral	85
Claude-4-Opus-Vision	80
OpenAI o1-pro-vision	79
Llama-3 (Vision)	70
GPT4 Omni (Vision)	63
Grok-3-Think-Vision	62

H/T: Trackingai.

Read next: Context, Emotion, and Biology: What AI Misses in Language Comprehension

From OpenAI's o3 to Grok-3 Vision: These AI Models Took the Mensa Test, Results May Surprise You

You might like