AI chatbots often claim confidence in their answers, even when those answers turn out wrong. A two-year study from researchers at Carnegie Mellon University examined how four leading language models performed when asked to judge their own accuracy. The research team compared them with human participants across different tasks involving predictions, knowledge, and image recognition.
The researchers asked each model and each person to give answers and then report how confident they were, both before and after the task. The tasks included NFL game outcomes, Oscar winners, a Pictionary-style guessing game, general trivia, and questions about university life. Although humans and chatbots both made confident guesses, people adjusted their expectations when they got things wrong. The AI systems did not. Some became more confident even after poor results.
In the football and Oscar tasks, the chatbots did reasonably well. ChatGPT, for example, predicted game outcomes with slightly better calibration than human participants. Gemini, while accurate on Oscar picks, failed to match its confidence to its real results. Bard showed marginal overconfidence across both tasks.
When tested on identifying hand-drawn images, ChatGPT correctly interpreted around twelve sketches out of twenty. Gemini, by contrast, scored below one correct answer on average. Yet it believed it had guessed more than fourteen correctly. Even after the task, it increased its estimated score. This showed a lack of self-monitoring. Human participants, by comparison, slightly adjusted their estimates and came closer to their actual performance.
The difference appeared more clearly in how participants handled feedback. Humans tended to shift their expectations after seeing how they performed. The chatbots did not. In some cases, their confidence increased regardless of performance. This pattern was more pronounced in visual and subjective tasks than in text-based ones.
The researchers found that Sonnet made more cautious predictions than the others. In trivia rounds, Sonnet often underestimated its ability, which made its confidence align better with its actual results. Haiku showed moderate task performance, but its confidence levels did not always match accuracy.
Across all tasks, humans showed more signs of learning from feedback. They improved their confidence ratings after experience. The language models lacked this adjustment. While they could express confidence, they did not revise their estimates in response to their own mistakes. This limited their ability to track their own reliability.
The study covered both aleatory tasks (where outcomes can’t be known in advance) and epistemic ones (where knowledge is possible but uncertain). In both types, chatbots struggled with metacognition. They often produced output with strong confidence, but that confidence did not reflect accuracy. Even when they failed, their estimates stayed high or rose further.
Each chatbot handled tasks differently. Some models performed well but expressed mismatched confidence. Others performed poorly and still reported high certainty. The contrast between performance and confidence was most visible in Gemini’s image recognition trial, where it performed the worst and yet remained the most sure of itself.
For users, the study highlights a key point. AI systems may appear confident, but that confidence often lacks internal correction. Without better self-monitoring, their certainty cannot be taken at face value. Users should approach AI-generated answers with caution, especially when accuracy matters.
The researchers suggest that AI models might learn to calibrate confidence more effectively if trained on larger feedback loops. Until then, the gap between what these systems say and how well they perform remains a concern. Human users can recognize uncertainty in others through behavior or hesitation. AI lacks those cues, and without clear signals, its confidence can be misleading.
The findings show that AI models can match human performance in some areas, but they still fall short in tracking how well they understand the task. This limitation affects how much trust people should place in chatbot responses, especially in unfamiliar or complex situations.
Notes: This post was edited/created using GenAI tools. Image: DIW-Aigen.
Read next: Google's AI Overviews Reduce Engagement With Traditional Links, Pew Data Shows
The researchers asked each model and each person to give answers and then report how confident they were, both before and after the task. The tasks included NFL game outcomes, Oscar winners, a Pictionary-style guessing game, general trivia, and questions about university life. Although humans and chatbots both made confident guesses, people adjusted their expectations when they got things wrong. The AI systems did not. Some became more confident even after poor results.
In the football and Oscar tasks, the chatbots did reasonably well. ChatGPT, for example, predicted game outcomes with slightly better calibration than human participants. Gemini, while accurate on Oscar picks, failed to match its confidence to its real results. Bard showed marginal overconfidence across both tasks.
When tested on identifying hand-drawn images, ChatGPT correctly interpreted around twelve sketches out of twenty. Gemini, by contrast, scored below one correct answer on average. Yet it believed it had guessed more than fourteen correctly. Even after the task, it increased its estimated score. This showed a lack of self-monitoring. Human participants, by comparison, slightly adjusted their estimates and came closer to their actual performance.
The difference appeared more clearly in how participants handled feedback. Humans tended to shift their expectations after seeing how they performed. The chatbots did not. In some cases, their confidence increased regardless of performance. This pattern was more pronounced in visual and subjective tasks than in text-based ones.
The researchers found that Sonnet made more cautious predictions than the others. In trivia rounds, Sonnet often underestimated its ability, which made its confidence align better with its actual results. Haiku showed moderate task performance, but its confidence levels did not always match accuracy.
Across all tasks, humans showed more signs of learning from feedback. They improved their confidence ratings after experience. The language models lacked this adjustment. While they could express confidence, they did not revise their estimates in response to their own mistakes. This limited their ability to track their own reliability.
The study covered both aleatory tasks (where outcomes can’t be known in advance) and epistemic ones (where knowledge is possible but uncertain). In both types, chatbots struggled with metacognition. They often produced output with strong confidence, but that confidence did not reflect accuracy. Even when they failed, their estimates stayed high or rose further.
Each chatbot handled tasks differently. Some models performed well but expressed mismatched confidence. Others performed poorly and still reported high certainty. The contrast between performance and confidence was most visible in Gemini’s image recognition trial, where it performed the worst and yet remained the most sure of itself.
For users, the study highlights a key point. AI systems may appear confident, but that confidence often lacks internal correction. Without better self-monitoring, their certainty cannot be taken at face value. Users should approach AI-generated answers with caution, especially when accuracy matters.
The researchers suggest that AI models might learn to calibrate confidence more effectively if trained on larger feedback loops. Until then, the gap between what these systems say and how well they perform remains a concern. Human users can recognize uncertainty in others through behavior or hesitation. AI lacks those cues, and without clear signals, its confidence can be misleading.
The findings show that AI models can match human performance in some areas, but they still fall short in tracking how well they understand the task. This limitation affects how much trust people should place in chatbot responses, especially in unfamiliar or complex situations.
Notes: This post was edited/created using GenAI tools. Image: DIW-Aigen.
Read next: Google's AI Overviews Reduce Engagement With Traditional Links, Pew Data Shows
