Concerns Raised as OpenAI’s o3 AI Model Scores Major Discrepancy Between First and Third-Party Benchmark Results

Many people are raising questions on OpenAI’s o3 AI model, which scored serious discrepancies in its benchmark results between first and third parties.

The model was first launched in December last year, and since then, the AI giant has shared how it could generate replies to a little more than a fourth of questions across FrontierMath. This included a challenging array of math sums. The score blew competition away big time, and the next best variant could only reply to 2% of the problems correctly on FrontierMath.

Today, most of the offerings are less than what the company had first anticipated, and the AI giant knows that. They are now trying hard to work on aggressive compute settings in test time so their scores can go beyond 25%.

The figure was most likely an upper bound. This means it was attained by o3 with greater computing behind than what the OpenAI model rolled out last week.

As per Epoch AI, which is the organization behind FrontierMath, the results of these independent tests for o3 were only 10%. And that’s well below the score that OpenAI made claims for. The company rolled out o3 as the most anticipated model to data for reasoning, other than its o4-mini, so these are definitely shocking results.

Now, does this mean the makers of ChatGPT blatantly lied? The answer is no, as the benchmark results that the firm published in December display a lower bound score that matches the score rolled out by Epoch. It also mentioned how the testing setup differs from OpenAI and how it used new releases of FrontierMath for any tests.

The main difference between the results is how they might be evaluated using powerful scaffolds. Perhaps Epoch AI ran it across different subsets for Frontier Math.

Meanwhile, another post published on X spoke about how there was a pre-release variant of o3 that’s different from the actual product in use. So again, that could have given rise to some discrepancies.

So far, all the released o3 compute tiers are so much smaller in scale than the variants benchmarked. Remember, larger compute tiers are expected to get higher benchmark scores. Re-testing the o3 would take a day or more, as the release today is quite different. This is why they’re now re-labeling the past reported results as a mere preview.

Remember, the o3 is currency in production and more optimized for use in the real world. So that could again give rise to disparities. The company shared how it completes optimizations to ensure the model is affordable and useful as a whole. However, they do hope that this model is better than the rest, and users don’t need to endure long waiting times as compared to past models.

Given the fact that the public release of o3 falls much shorter of testing promises made by OpenAI, experts claim that AI benchmarks cannot be considered at face value. This is very true when the actual source is a parent firm with many services on offer. Also, such controversies linked to benchmarking are slowly becoming more common with the passing time. Vendors are racing to make headlines and share the newest models with the best capabilities.


Image: DIW-Aigen

Read next: Attackers Abuse Google’s Security System to Send Fake Emails Featuring Google Account Credentials
Previous Post Next Post