Recent findings from the European Broadcasting Union show that AI assistants misrepresent news content in 45% of the test cases, regardless of language or region. That result underscores why model accuracy and reliability remain central concerns. Fresh rankings from Artificial Analysis, based on real-world endpoint testing as of 1 December 2025, give a clear picture of how today’s leading systems perform when answering direct questions.
Measuring Accuracy and Hallucination Rates
Artificial Analysis evaluates both proprietary and open weights models through live API endpoints. Their measurements reflect what users experience in actual deployments rather than theoretical performance. Accuracy shows how often a model produces correct answers. Hallucination rate captures how often it responds incorrectly when it should refuse or indicate uncertainty. Since new models launch frequently and providers adjust endpoints, these results can change over time, but the current snapshot still reveals clear trends.
Models With the Highest Hallucination Rates
| Model | Hallucination Rate |
|---|---|
| Claude 4.5 Haiku | 26% |
| Claude 4.5 Sonnet | 48% |
| GPT-5.1 (high) | 51% |
| Claude Opus 4.5 | 58% |
| Magistral Medium 1.2 | 60% |
| Grok 4 | 64% |
| Kimi K2 0905 | 69% |
| Grok 4.1 Fast | 72% |
| Kimi K2 Thinking | 74% |
| Llama Nemotron Super 49B v1.5 | 76% |
| DeepSeek V3.2 Ex | 81% |
| DeepSeek R1 0528 | 83% |
| EXAONE 4.032B | 86% |
| Llama 4 Maverick | 87.58% |
| Gemini 3 Pro Preview (high) | 87.99% |
| Gemini 2.5 Flash (Sep) | 88.31% |
| Gemini 2.5 Pro | 88.57% |
| MiniMax-M2 | 88.88% |
| GPT-5.1 | 89.17% |
| Qwen3 235B A22B 2507 | 89.64% |
| gpt-oss-120B (high) | 89.96% |
| GLM-4.6 | 93.09% |
| gpt-oss-20B (high) | 93.20% |
When it comes to hallucination, the gap between models is striking. Claude 4.5 Haiku has the lowest hallucination rate in this group at 26 percent, yet even this relatively low figure indicates that incorrect answers are common. Several models climb sharply from there. Claude 4.5 Sonnet reaches 48 percent, GPT-5.1 (High) 51 percent, and Claude Opus 4.5 58 percent. Grok 4 produces incorrect responses 64 percent of the time, and Kimi K2 0905 rises to 69 percent. This spread demonstrates that while some models are relatively restrained, many generate incorrect answers frequently, making hallucination a major challenge for AI systems today.
Top Performers in Accuracy
| Model | Accuracy |
|---|---|
| Gemini 3 Preview (High) | 54% |
| Claude Opus 4.5 | 43% |
| Grok 4 | 40% |
| Gemini 2.5 Pro | 37% |
| GPT-5.1 (High) | 35% |
| Claude 4.5 Sonnet | 31% |
| DeepSeek R1 0508 | 29.28% |
| Kimi K2 Thinking | 29.23% |
| GPT-5.1 | 28% |
| Gemini 2.5 Flash (Sep) | 27% |
| DeepSeek V3.2 Exp | 27% |
| GLM-4.6 | 25% |
| Kimi K2 0905 | 24% |
| Llama 4 Maverick | 24% |
| Grok 4.1 Fast | 23.50% |
| Qwen3 235B A22B 2507 | 22% |
| MiniMax-M2 | 21% |
| Magistral Medium 1.2 | 20% |
| gpt-oss-120B (High) | 20% |
| Claude 4.5 Haiku | 16% |
| Llama Nemotron Super 49B v1.5 | 16% |
| gpt-oss-20B (High) | 15% |
Accuracy presents a different picture. Gemini 3 Preview (High) leads the pack at 54 percent, meaning it correctly answers just over half of all questions, followed by Claude Opus 4.5 at 43 percent and Grok 4 at 40 percent. Gemini 2.5 Pro comes next with 37 percent, while GPT-5.1 (High) reaches 35 percent and Claude 4.5 Sonnet 31 percent. The spread highlights that even the top-performing models answer fewer than six out of ten questions correctly, showing the inherent difficulty AI faces in delivering consistently reliable responses across a broad set of prompts.
Clear Trade-offs
The contrast between hallucination and accuracy charts shows that strong accuracy does not guarantee low hallucination. Some high-ranking models in accuracy still produce incorrect answers at significant rates. Others deliver lower accuracy yet avoid the highest hallucination levels. These gaps illustrate how unpredictable model behavior remains, even as systems improve.
Updated on December 11, 2025 to add an AI disclosure and trim two repetitive paragraphs.
Notes: This post was drafted with the assistance of AI tools and reviewed, edited, and published by humans. Read next:
Read next: ChatGPT Doubles Usage as Google Gemini Reaches 40 Percent

