Which AI Models Answer Most Accurately, and Which Hallucinate Most? New Data Shows Clear Gaps

Recent findings from the European Broadcasting Union show that AI assistants misrepresent news content in 45% of the test cases, regardless of language or region. That result underscores why model accuracy and reliability remain central concerns. Fresh rankings from Artificial Analysis, based on real-world endpoint testing as of 1 December 2025, give a clear picture of how today’s leading systems perform when answering direct questions.

Measuring Accuracy and Hallucination Rates

Artificial Analysis evaluates both proprietary and open weights models through live API endpoints. Their measurements reflect what users experience in actual deployments rather than theoretical performance. Accuracy shows how often a model produces correct answers. Hallucination rate captures how often it responds incorrectly when it should refuse or indicate uncertainty. Since new models launch frequently and providers adjust endpoints, these results can change over time, but the current snapshot still reveals clear trends.

Models With the Highest Hallucination Rates

Hallucination Metrics Expose Deep Reliability Risks in Current AI Assistant Deployments
ModelHallucination Rate
Claude 4.5 Haiku26%
Claude 4.5 Sonnet48%
GPT-5.1 (high)51%
Claude Opus 4.558%
Magistral Medium 1.260%
Grok 464%
Kimi K2 090569%
Grok 4.1 Fast72%
Kimi K2 Thinking74%
Llama Nemotron Super 49B v1.576%
DeepSeek V3.2 Ex81%
DeepSeek R1 052883%
EXAONE 4.032B86%
Llama 4 Maverick87.58%
Gemini 3 Pro Preview (high)87.99%
Gemini 2.5 Flash (Sep)88.31%
Gemini 2.5 Pro88.57%
MiniMax-M288.88%
GPT-5.189.17%
Qwen3 235B A22B 250789.64%
gpt-oss-120B (high)89.96%
GLM-4.693.09%
gpt-oss-20B (high)93.20%

When it comes to hallucination, the gap between models is striking. Claude 4.5 Haiku has the lowest hallucination rate in this group at 26 percent, yet even this relatively low figure indicates that incorrect answers are common. Several models climb sharply from there. Claude 4.5 Sonnet reaches 48 percent, GPT-5.1 (High) 51 percent, and Claude Opus 4.5 58 percent. Grok 4 produces incorrect responses 64 percent of the time, and Kimi K2 0905 rises to 69 percent. This spread demonstrates that while some models are relatively restrained, many generate incorrect answers frequently, making hallucination a major challenge for AI systems today.

Top Performers in Accuracy

Testing Reveals Limited Accuracy Gains Despite Rapid Deployment of Advanced AI Systems
ModelAccuracy
Gemini 3 Preview (High)54%
Claude Opus 4.543%
Grok 440%
Gemini 2.5 Pro37%
GPT-5.1 (High)35%
Claude 4.5 Sonnet31%
DeepSeek R1 050829.28%
Kimi K2 Thinking29.23%
GPT-5.128%
Gemini 2.5 Flash (Sep)27%
DeepSeek V3.2 Exp27%
GLM-4.625%
Kimi K2 090524%
Llama 4 Maverick24%
Grok 4.1 Fast23.50%
Qwen3 235B A22B 250722%
MiniMax-M221%
Magistral Medium 1.220%
gpt-oss-120B (High)20%
Claude 4.5 Haiku16%
Llama Nemotron Super 49B v1.516%
gpt-oss-20B (High)15%

Accuracy presents a different picture. Gemini 3 Preview (High) leads the pack at 54 percent, meaning it correctly answers just over half of all questions, followed by Claude Opus 4.5 at 43 percent and Grok 4 at 40 percent. Gemini 2.5 Pro comes next with 37 percent, while GPT-5.1 (High) reaches 35 percent and Claude 4.5 Sonnet 31 percent. The spread highlights that even the top-performing models answer fewer than six out of ten questions correctly, showing the inherent difficulty AI faces in delivering consistently reliable responses across a broad set of prompts.

Clear Trade-offs

The contrast between hallucination and accuracy charts shows that strong accuracy does not guarantee low hallucination. Some high-ranking models in accuracy still produce incorrect answers at significant rates. Others deliver lower accuracy yet avoid the highest hallucination levels. These gaps illustrate how unpredictable model behavior remains, even as systems improve.

Updated on December 11, 2025 to add an AI disclosure and trim two repetitive paragraphs.

Notes: This post was drafted with the assistance of AI tools and reviewed, edited, and published by humans. Read next:

Read next: ChatGPT Doubles Usage as Google Gemini Reaches 40 Percent

Previous Post Next Post