A new look at GPT-4o shows how easily citation trouble grows when the topic drifts into areas with thin research trails. The warning signs appear clearly in a controlled test where the model produced six mental health literature reviews and filled them with 176 citations. A closer inspection showed that 35 of those references did not exist and many of the remaining ones carried mistakes hidden in the details. The numbers set the tone for a pattern that shifts sharply depending on how familiar the topic is in the research world.
Researchers at Deakin University built the experiment around three disorders. Depression sat at the top of the visibility ladder, followed by binge eating disorder, then body dysmorphic disorder at the bottom. This mix created a natural gradient in research volume. Depression carries decades of trials and thousands of papers. The other two conditions occupy smaller footprints and offer far fewer studies on digital interventions. That uneven landscape became the test bed for the model’s strengths and misses.
Each disorder received two review requests. One prompt asked for a broad overview that covered causes, impacts and treatments. The other request drilled into digital interventions. The team wanted to see how topic familiarity and prompt depth shaped the reliability of the citations. They pulled every reference into a manual check across major academic databases. This process placed each citation into one of three buckets. Either it existed in the real world, it existed but contained errors, or it was fabricated outright.
The headline numbers make the problem easy to see. Out of 176 total citations, 35 were fabricated. Among the 141 real ones, 64 carried errors. Only 77 came through fully accurate. That means around half of all citations were unusable in scholarly work. DOI errors were the most common type of error. Wrong links, wrong codes, or completely invalid strings made many citations look correct at first glance but fail when checked against the actual paper.
The pattern became sharper when the team compared the three disorders. Depression showed the lowest fabrication count with only four fake citations out of 68. Binge eating disorder jumped to seventeen fabricated citations out of sixty. Body dysmorphic disorder followed closely with fourteen fabricated citations out of forty eight. Accuracy among the real citations also depended on the topic. Depression reached sixty four percent accuracy. Binge eating disorder reached sixty. Body dysmorphic disorder fell to twenty nine. The drop shows how the model struggles once the evidence base gets thin enough.
Prompt specificity also shaped outcomes, though not in a simple way. Binge eating disorder showed the clearest effect. Its specialized review saw fabrication rise to almost half of the citations. The general overview stayed closer to one out of six. Other disorders showed different patterns. Depression’s general overview delivered better accuracy than its specialized review. Body dysmorphic disorder flipped that pattern and showed better accuracy when the prompt narrowed. These differences suggest the model reacts to the structure of the request and the strength of the underlying literature in different ways.
The study’s authors point out how much the model leans on patterns in public information. When the topic sits on a wide and stable base of research, the model has clearer pathways to follow. When the topic shifts to areas with fewer papers or narrower lines of inquiry, the model relies more on guesswork. The results from body dysmorphic disorder show how quickly accuracy collapses when the system tries to piece together references from scattered or limited material.
These findings matter because more researchers have started using large language models to speed up routine tasks. Survey data shows strong adoption among mental health scientists. Many researchers believe these systems help with drafting, coding, and early idea formation. Efficiency gains look promising until the citations fall apart under verification. That creates problems for anyone who trusts the output without checking every reference. A fabricated citation can mislead a research team, distort the evidence trail, and send other scientists searching for sources that were never written.
The study pushes institutions and journals toward simple safeguards. Every AI generated citation needs to be verified. Every claim tied to those citations needs human confirmation. Editors can screen suspicious references by checking whether they match known publications. When a citation sits outside any recognized record, it becomes a clear red flag. With these checks in place, journals can block fabricated references before they reach print.
The authors also point to the need for stronger guidance at universities and research centers. Training programs can help researchers learn how to identify hallucinations and validate AI generated content before placing it in a manuscript. As AI tools become part of normal workflows, these checks will keep the academic record from drifting into mistaken territory.
The results show that reliability is not static. It depends on the openness of the research terrain. Well studied disorders give the model a broader map. Narrower or less familiar topics cut away those supports. For now, the safest way to use these systems in research is to treat their output as a starting point that always needs careful checking. The experiment makes that reality clear.
Read next: Meta Adds New Content Protection Tools to Help Creators Spot Copycats
Researchers at Deakin University built the experiment around three disorders. Depression sat at the top of the visibility ladder, followed by binge eating disorder, then body dysmorphic disorder at the bottom. This mix created a natural gradient in research volume. Depression carries decades of trials and thousands of papers. The other two conditions occupy smaller footprints and offer far fewer studies on digital interventions. That uneven landscape became the test bed for the model’s strengths and misses.
Each disorder received two review requests. One prompt asked for a broad overview that covered causes, impacts and treatments. The other request drilled into digital interventions. The team wanted to see how topic familiarity and prompt depth shaped the reliability of the citations. They pulled every reference into a manual check across major academic databases. This process placed each citation into one of three buckets. Either it existed in the real world, it existed but contained errors, or it was fabricated outright.
The headline numbers make the problem easy to see. Out of 176 total citations, 35 were fabricated. Among the 141 real ones, 64 carried errors. Only 77 came through fully accurate. That means around half of all citations were unusable in scholarly work. DOI errors were the most common type of error. Wrong links, wrong codes, or completely invalid strings made many citations look correct at first glance but fail when checked against the actual paper.
The pattern became sharper when the team compared the three disorders. Depression showed the lowest fabrication count with only four fake citations out of 68. Binge eating disorder jumped to seventeen fabricated citations out of sixty. Body dysmorphic disorder followed closely with fourteen fabricated citations out of forty eight. Accuracy among the real citations also depended on the topic. Depression reached sixty four percent accuracy. Binge eating disorder reached sixty. Body dysmorphic disorder fell to twenty nine. The drop shows how the model struggles once the evidence base gets thin enough.
Prompt specificity also shaped outcomes, though not in a simple way. Binge eating disorder showed the clearest effect. Its specialized review saw fabrication rise to almost half of the citations. The general overview stayed closer to one out of six. Other disorders showed different patterns. Depression’s general overview delivered better accuracy than its specialized review. Body dysmorphic disorder flipped that pattern and showed better accuracy when the prompt narrowed. These differences suggest the model reacts to the structure of the request and the strength of the underlying literature in different ways.
The study’s authors point out how much the model leans on patterns in public information. When the topic sits on a wide and stable base of research, the model has clearer pathways to follow. When the topic shifts to areas with fewer papers or narrower lines of inquiry, the model relies more on guesswork. The results from body dysmorphic disorder show how quickly accuracy collapses when the system tries to piece together references from scattered or limited material.
These findings matter because more researchers have started using large language models to speed up routine tasks. Survey data shows strong adoption among mental health scientists. Many researchers believe these systems help with drafting, coding, and early idea formation. Efficiency gains look promising until the citations fall apart under verification. That creates problems for anyone who trusts the output without checking every reference. A fabricated citation can mislead a research team, distort the evidence trail, and send other scientists searching for sources that were never written.
The study pushes institutions and journals toward simple safeguards. Every AI generated citation needs to be verified. Every claim tied to those citations needs human confirmation. Editors can screen suspicious references by checking whether they match known publications. When a citation sits outside any recognized record, it becomes a clear red flag. With these checks in place, journals can block fabricated references before they reach print.
The authors also point to the need for stronger guidance at universities and research centers. Training programs can help researchers learn how to identify hallucinations and validate AI generated content before placing it in a manuscript. As AI tools become part of normal workflows, these checks will keep the academic record from drifting into mistaken territory.
The results show that reliability is not static. It depends on the openness of the research terrain. Well studied disorders give the model a broader map. Narrower or less familiar topics cut away those supports. For now, the safest way to use these systems in research is to treat their output as a starting point that always needs careful checking. The experiment makes that reality clear.
Read next: Meta Adds New Content Protection Tools to Help Creators Spot Copycats

Very interesting study! I imagine these results can apply to humanities fields as well, with more obscure fields like Egyptology (my field) having more hallucinated citations.
ReplyDelete