A recent intelligence benchmark has placed today’s most advanced AI models under the same kind of cognitive scrutiny used to assess exceptional human thinkers, and the outcome tells a story of contrasts between raw verbal reasoning and multimodal complexity.
The data comes from the Mensa Norway IQ test, a well-known measure of high-level reasoning, where scores above 130 often mark out genius-level ability. Although the test was designed for people, researchers have begun using it to compare how artificial intelligence systems perform when asked to solve the same kinds of abstract problems humans struggle with.
At the top of the current rankings sits OpenAI’s o3, which scored 133, just shy of the upper boundary of human IQ scales. Not far behind is Gemini Thinking, Google’s language-focused model, which reached 128. These results suggest that, at least in abstract problem solving through words and logic, some AI systems are not just matching human performance but quietly exceeding it.
The upper tier includes OpenAI’s o4-mini with a score of 126, Gemini Pro at 124, and both Claude-4 Opus and Claude-4 Sonnet tied at 118. Even models just below this line, like Grok-3 Think (111), Llama-4 (107), and DeepSeek-R1 (105), are operating within or above the average human range.
But the drop-off begins sharply as models shift from text-only processing to visual capabilities. Systems like Claude-4 Sonnet Vision, GPT-4.5, Grok-3, and deepseek-v3, all scoring 97, sit right at the border of human average. Just below them, Gemini Pro Vision landed at 96, while GPT-4 Omni (Verbal) trailed at 91, despite its verbal focus.
OpenAI’s o4-mini-high reached 90, but the decline continues. Visual variants such as o3-vision and Bing’s AI scored 86, followed by Mistral (85) and Claude-4 Opus Vision (80). Further down the list, models like OpenAI o1-pro Vision (79) and Llama-3 Vision (70) show a widening gap between multimodal ambition and actual performance on reasoning tasks.
At the lowest end sit GPT-4 Omni Vision and Grok-3 Think Vision, managing only 63 and 62 respectively — scores that, in human terms, would reflect severe limitations in pattern recognition and logic.
What becomes clear through this ranking is that text-based reasoning remains AI’s strong suit. Models trained purely on language continue to outperform their multimodal counterparts when faced with symbol-based puzzles and logic problems. While vision-enabled AIs might be better suited for real-world perception, they appear less capable when reasoning is abstracted from context and stripped to logic alone.
These findings underscore a split in the development arc of artificial intelligence. Verbal models are now working at, and sometimes above, human cognitive levels. But giving machines the ability to “see” doesn’t yet mean they understand. At least not in the ways intelligence is traditionally measured.
H/T: Trackingai.
Read next: Context, Emotion, and Biology: What AI Misses in Language Comprehension
The data comes from the Mensa Norway IQ test, a well-known measure of high-level reasoning, where scores above 130 often mark out genius-level ability. Although the test was designed for people, researchers have begun using it to compare how artificial intelligence systems perform when asked to solve the same kinds of abstract problems humans struggle with.
At the top of the current rankings sits OpenAI’s o3, which scored 133, just shy of the upper boundary of human IQ scales. Not far behind is Gemini Thinking, Google’s language-focused model, which reached 128. These results suggest that, at least in abstract problem solving through words and logic, some AI systems are not just matching human performance but quietly exceeding it.
The upper tier includes OpenAI’s o4-mini with a score of 126, Gemini Pro at 124, and both Claude-4 Opus and Claude-4 Sonnet tied at 118. Even models just below this line, like Grok-3 Think (111), Llama-4 (107), and DeepSeek-R1 (105), are operating within or above the average human range.
But the drop-off begins sharply as models shift from text-only processing to visual capabilities. Systems like Claude-4 Sonnet Vision, GPT-4.5, Grok-3, and deepseek-v3, all scoring 97, sit right at the border of human average. Just below them, Gemini Pro Vision landed at 96, while GPT-4 Omni (Verbal) trailed at 91, despite its verbal focus.
OpenAI’s o4-mini-high reached 90, but the decline continues. Visual variants such as o3-vision and Bing’s AI scored 86, followed by Mistral (85) and Claude-4 Opus Vision (80). Further down the list, models like OpenAI o1-pro Vision (79) and Llama-3 Vision (70) show a widening gap between multimodal ambition and actual performance on reasoning tasks.
At the lowest end sit GPT-4 Omni Vision and Grok-3 Think Vision, managing only 63 and 62 respectively — scores that, in human terms, would reflect severe limitations in pattern recognition and logic.
What becomes clear through this ranking is that text-based reasoning remains AI’s strong suit. Models trained purely on language continue to outperform their multimodal counterparts when faced with symbol-based puzzles and logic problems. While vision-enabled AIs might be better suited for real-world perception, they appear less capable when reasoning is abstracted from context and stripped to logic alone.
These findings underscore a split in the development arc of artificial intelligence. Verbal models are now working at, and sometimes above, human cognitive levels. But giving machines the ability to “see” doesn’t yet mean they understand. At least not in the ways intelligence is traditionally measured.
Category | Mensa Norway IQ Test Score |
---|---|
OpenAI o3 | 133 |
Gemini Thinking | 128 |
OpenAI o4-mini | 126 |
Gemini Pro | 124 |
Claude-4-Opus | 118 |
Claude-4-Sonnet | 118 |
Grok-3-Think | 111 |
Llama-4 | 107 |
DeepSeek-R1 | 105 |
OpenAI o1-pro | 102 |
Average Human | 100 |
Claude-4-Sonnet-Vision | 97 |
gpt-4.5 | 97 |
deepseek-v3 | 97 |
Grok-3 | 97 |
Gemini Pro (Vision) | 96 |
GPT4 Omni (Verbal) | 91 |
OpenAI o4-mini-high | 90 |
OpenAI o3-vision | 86 |
Bing | 86 |
Mistral | 85 |
Claude-4-Opus-Vision | 80 |
OpenAI o1-pro-vision | 79 |
Llama-3 (Vision) | 70 |
GPT4 Omni (Vision) | 63 |
Grok-3-Think-Vision | 62 |
H/T: Trackingai.
Read next: Context, Emotion, and Biology: What AI Misses in Language Comprehension