OpenAI’s Latest o3 and o4-mini AI Models Disappoint Due to More Hallucinations than Older Models

AI models have struggled in the past with hallucinations, and many would think that newer systems would steer clear of such problems. However, that’s not the case with the latest launch of OpenAI’s o3 and o4-mini AI models.

They’ve been accused of showing greater hallucinations than older variants, impacting the system big time. The fact that they’re making up facts and presenting them as reality has people talking.

Every model is supposed to get better with time or hallucinate less than its predecessor. Seeing that factor absent from the latest lineup is worrisome. As per the latest internal tests from the company, the reasoning models are hallucinate more frequently than previous versions as well as the non-reasoning one like GPT-4o.

What is even more concerning is the fact that ChatGPT doesn’t even know what’s going on and what the solution to the problem might be. OpenAI's internal report says that a comprehensive evaluation report is needed to determine why it’s getting worse as models get more advanced. While it does perform better at other activities such as math and coding, this negative aspect cannot be overlooked.

The company did find its o3 produced hallucinations in more than 33% of its replies on PersonQA. The latter is the firm’s own benchmark for determining AI models’ accuracy. That’s nearly double the hallucination rate observed for models in the past.

Thanks to third-party testing carried out by Transluce, another nonprofit research lab, it was proven that o3 tends to make up actions while reaching a solution. So it’s making up false narrations about things it’s not doing and tools it cannot access, which again is alarming.

The rate at which the hallucinations keep increasing for o3 means it could amplify the issues that are usually solved. This means the model might serve less useful purposes than what it’s intended to do.

Whole some experts do call the model a step above the competition, the fact that it can even hallucinate broken links from the website is the worrisome part. This means it provides links really don’t work.

These hallucinations do assist models in reaching interesting conclusions and ideas that are very creative in terms of the thinking process. At the same time, they make models a touch catch for firms where accuracy is essential. For instance, no law firm would be keen on using a model that adds so many factual errors into contracts that are served to clients.

One technique that could help boost the accuracy of these models is providing web search functionalities. The GPT-4o attains 90% accuracy, so we might use this technique to expose prompts to other providers of third-party search.

OpenAI did share a statement on this front about how it realizes the importance of growing rates of hallucinations. It’s working to work alongside others to find a solution urgently. Right now, it’s under research so that models are more reliable and accurate in the end.

Image: DIW-Aigen

Read next:

• Marketers Embrace AI: 85% Use Tools, 70% See Efficiency Gains, and 50% Save Over 5 Hours Weekly

• Is Meta’s Threads the Underdog Ready to Outpace Elon’s X, or Will Bluesky Steal the Spotlight?

OpenAI’s Latest o3 and o4-mini AI Models Disappoint Due to More Hallucinations than Older Models

Dr. Hura Anwar

You might like