Artificial intelligence tools designed to simplify scientific literature are increasingly being used by researchers, writers, and curious readers alike. But a recent investigation has raised concerns that these systems may be introducing serious distortions rather than delivering clarity.
In a peer-reviewed study published in Royal Society Open Science, a group of researchers analyzed how today’s leading language models interpret and rewrite complex scientific texts. Their findings were troubling: models widely praised for summarization, including well-known systems such as ChatGPT, Claude, DeepSeek, and LLaMA, frequently altered the meaning of the original studies. Instead of simply translating dense technical material into digestible summaries, they often injected unwarranted certainty or broadened narrow findings into sweeping claims.
The research, led by Uwe Peters of Utrecht University and Benjamin Chin-Yee of Western University and the University of Cambridge, came in response to a growing dependence on AI for science communication. Their team wanted to find out whether these automated summaries help or hinder accurate understanding, especially when shared beyond academic audiences.
To explore that question, the researchers assembled a large set of scientific material—including 200 research abstracts and 100 full-length articles—from top medical and scientific journals such as Nature, Science, The Lancet, and The New England Journal of Medicine. Using ten major language models, including GPT-4 Turbo, Claude 3.7 Sonnet, ChatGPT-4o, and DeepSeek, they generated nearly 5,000 summaries and examined how faithfully each version reflected the original evidence.
What they discovered was a pattern. In many cases, the AI-produced summaries shifted the tone and meaning of the original texts. Instead of reporting that a particular drug “showed potential benefits in some patients,” the summary might claim that the drug “improves outcomes,” implying broader effectiveness than the study had shown. These overstatements were especially common among the newest generation of AI models.
Even more surprisingly, the problem got worse when the models were told to be careful. Prompts asking the systems to “avoid errors” or “summarize accurately” tended to backfire. Instead of improving, the models produced more assertive, generalized conclusions. The researchers suggested that such prompts might unintentionally steer the models toward sounding more authoritative, even if that means ignoring nuance or uncertainty in the original material.
Older models, including GPT-3.5 and earlier versions of Claude, showed fewer of these issues. Their summaries tended to stay closer to the original text in both tone and scope. By contrast, newer systems like ChatGPT-4o and LLaMA 3.3 were up to 73% more likely to introduce overgeneralizations.
The study also compared machine-written summaries with those crafted by humans. Using expert-written entries from NEJM Journal Watch as a benchmark, the team found that AI-generated summaries were almost five times more likely to misrepresent the original findings. The human versions, written by professionals trained to interpret scientific nuance, kept conclusions grounded in the data.
The team went a step further, exploring how different technical settings influenced the output. When models were run through an API interface with their “temperature” parameter set to zero—a setting that limits randomness—the risk of distortion dropped significantly. But since most users interact with chatbots through public-facing apps that don’t allow fine-tuning, this method may not be available to all.
Importantly, the study clarified that not all simplifications are harmful. In some cases, turning technical jargon into clear language helps non-experts grasp essential ideas. But when simplification becomes exaggeration—especially in areas like medicine—it creates real risks. Misinterpreted findings can mislead the public, influence policy debates, or affect healthcare decisions.
While the focus of the research was on overgeneralization, the authors also acknowledged the reverse can happen: models may understate findings, softening clear conclusions into vague summaries. But in this analysis, that kind of error occurred far less frequently.
In closing, the authors encouraged developers and users to rethink how they deploy AI for summarizing science. Strategies like avoiding overly cautious prompts, opting for more conservative models, and allowing better customization of technical parameters might reduce these issues. Still, the researchers noted that more work is needed, especially across scientific fields beyond medicine.
Their study—titled Generalization Bias in Large Language Model Summarization of Scientific Research—offers one of the most detailed evaluations yet of how AI interacts with the boundaries of scientific evidence. It’s a wake-up call for a world increasingly looking to machines for understanding, not just information.
Image: DIW-Aigen
Read next:
• Google Expands Search Live to iOS and Search, Bringing Real-Time Visual Conversations Across Platforms
• Google Just Launched a Tool That Tells You If Something Was Made by Its AI
• Google Reveals Flow: A Creator-Focused AI Video Suite Built on Veo, Imagen, and Gemini
In a peer-reviewed study published in Royal Society Open Science, a group of researchers analyzed how today’s leading language models interpret and rewrite complex scientific texts. Their findings were troubling: models widely praised for summarization, including well-known systems such as ChatGPT, Claude, DeepSeek, and LLaMA, frequently altered the meaning of the original studies. Instead of simply translating dense technical material into digestible summaries, they often injected unwarranted certainty or broadened narrow findings into sweeping claims.
The research, led by Uwe Peters of Utrecht University and Benjamin Chin-Yee of Western University and the University of Cambridge, came in response to a growing dependence on AI for science communication. Their team wanted to find out whether these automated summaries help or hinder accurate understanding, especially when shared beyond academic audiences.
To explore that question, the researchers assembled a large set of scientific material—including 200 research abstracts and 100 full-length articles—from top medical and scientific journals such as Nature, Science, The Lancet, and The New England Journal of Medicine. Using ten major language models, including GPT-4 Turbo, Claude 3.7 Sonnet, ChatGPT-4o, and DeepSeek, they generated nearly 5,000 summaries and examined how faithfully each version reflected the original evidence.
What they discovered was a pattern. In many cases, the AI-produced summaries shifted the tone and meaning of the original texts. Instead of reporting that a particular drug “showed potential benefits in some patients,” the summary might claim that the drug “improves outcomes,” implying broader effectiveness than the study had shown. These overstatements were especially common among the newest generation of AI models.
Even more surprisingly, the problem got worse when the models were told to be careful. Prompts asking the systems to “avoid errors” or “summarize accurately” tended to backfire. Instead of improving, the models produced more assertive, generalized conclusions. The researchers suggested that such prompts might unintentionally steer the models toward sounding more authoritative, even if that means ignoring nuance or uncertainty in the original material.
Older models, including GPT-3.5 and earlier versions of Claude, showed fewer of these issues. Their summaries tended to stay closer to the original text in both tone and scope. By contrast, newer systems like ChatGPT-4o and LLaMA 3.3 were up to 73% more likely to introduce overgeneralizations.
The study also compared machine-written summaries with those crafted by humans. Using expert-written entries from NEJM Journal Watch as a benchmark, the team found that AI-generated summaries were almost five times more likely to misrepresent the original findings. The human versions, written by professionals trained to interpret scientific nuance, kept conclusions grounded in the data.
The team went a step further, exploring how different technical settings influenced the output. When models were run through an API interface with their “temperature” parameter set to zero—a setting that limits randomness—the risk of distortion dropped significantly. But since most users interact with chatbots through public-facing apps that don’t allow fine-tuning, this method may not be available to all.
Importantly, the study clarified that not all simplifications are harmful. In some cases, turning technical jargon into clear language helps non-experts grasp essential ideas. But when simplification becomes exaggeration—especially in areas like medicine—it creates real risks. Misinterpreted findings can mislead the public, influence policy debates, or affect healthcare decisions.
While the focus of the research was on overgeneralization, the authors also acknowledged the reverse can happen: models may understate findings, softening clear conclusions into vague summaries. But in this analysis, that kind of error occurred far less frequently.
In closing, the authors encouraged developers and users to rethink how they deploy AI for summarizing science. Strategies like avoiding overly cautious prompts, opting for more conservative models, and allowing better customization of technical parameters might reduce these issues. Still, the researchers noted that more work is needed, especially across scientific fields beyond medicine.
Their study—titled Generalization Bias in Large Language Model Summarization of Scientific Research—offers one of the most detailed evaluations yet of how AI interacts with the boundaries of scientific evidence. It’s a wake-up call for a world increasingly looking to machines for understanding, not just information.
Image: DIW-Aigen
Read next:
• Google Expands Search Live to iOS and Search, Bringing Real-Time Visual Conversations Across Platforms
• Google Just Launched a Tool That Tells You If Something Was Made by Its AI
• Google Reveals Flow: A Creator-Focused AI Video Suite Built on Veo, Imagen, and Gemini