Study Shows Human Behavior Undermines AI’s Medical Accuracy Outside Test Settings

AI tools like GPT-4 have been making headlines for passing medical exams and even outperforming licensed doctors in test settings. But new research from the University of Oxford suggests that while AI might shine in test conditions, it often stumbles when actual people rely on it for real health decisions.

A Big Gap Between Test Scores and Real Use

When asked directly, GPT-4 could identify the right diagnosis nearly 95% of the time. But things changed when everyday people tried to use the same tools to figure out what was wrong with them. In that case, the success rate dropped to just under 35%. Oddly enough, people who didn’t use AI at all were more accurate. In fact, they were about 76% more likely to name the correct condition than those using the AI.

How the Study Worked

Oxford researchers brought in 1,298 people to play the role of patients. Each person was given a short medical scenario, that is, a story with symptoms, personal background, and sometimes misleading info. Their task was to decide what might be wrong and what level of care they should seek, ranging from home remedies to calling for an ambulance.
Participants could use one of three AI models, GPT-4o, Llama 3, or Command R+. A group of real doctors had already decided on the correct diagnosis and action plan for each case. One example involved a student who got a sudden, intense headache while out with friends. The right call was a brain scan - he was having a type of brain bleed.

Where Things Went Off Track

When people used the AI tools, they often left out important details. Others misunderstood what the AI told them or ignored it completely. In one case, a person with symptoms of gallstones said they had severe stomach pain after eating takeout but didn’t explain where the pain was or how often it happened. The AI assumed it was indigestion, and the person agreed.
Even when the AI offered helpful information, users didn’t always use it. GPT-4o brought up a correct diagnosis in about two-thirds of cases. But fewer than 35% of users included that condition in their final decision.

How Human Behavior Changes the Outcome

Experts say this result isn’t shocking. AI needs clear, detailed input to do its job well. But someone who feels sick or panicked often can’t explain their symptoms clearly. Unlike trained doctors who know how to ask the right follow-up questions, an AI can only respond to what it's told.

Also, trust plays a role. People might not believe the AI’s advice or fully understand what it says. These human factors can limit how useful AI is in real life.

Why Test Scores Can Be Misleading

One lesson from the study is that high scores on standard tests don’t mean a model is ready for the real world. Most of these exams are made for humans, not machines. They don’t test how well an AI handles unclear input, emotional responses, or vague wording.

Think of a chatbot trained to answer customer service questions. It might do well on practice quizzes, but struggle with real users who type casually or express frustration. Without live testing with real people, those perfect scores don’t mean much.

AI Talking to AI Isn’t the Same

Oxford researchers also tried letting one AI act like a patient and another give the advice. These AI-to-AI conversations did better, about 61% of the time, the “patient” AI guessed the right problem. But this success is a bit of a trick. It shows that AI tools work well with each other, not necessarily with humans.

It’s Not the User’s Fault

Some might think users are to blame for the AI failures. But user experience experts say the real problem lies in design. If people can’t get the right help, it’s a sign the system isn’t built to match how people think or behave.

The study offers a clear warning: strong performance in a quiet lab doesn’t equal success in the messiness of real life. For any AI meant to work with people, testing with people is essential. Otherwise, we risk building smart tools that fall flat when it matters most.

Image: DIW-Aigen

Read next:

• From OpenAI's o3 to Grok-3 Vision: These AI Models Took the Mensa Test, Results May Surprise You

• The Hidden Cost of Free AI Tools: Your Behavior, Habits, and Identity

• Apple’s AI Critique Faces Pushback Over Flawed Testing Methods
Previous Post Next Post