Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Mental Health

Date Submitted: Jul 9, 2025
Date Accepted: Sep 15, 2025

The final, peer-reviewed published version of this preprint can be found here:

Influence of Topic Familiarity and Prompt Specificity on Citation Fabrication in Mental Health Research Using Large Language Models: Experimental Study

Linardon J, Jarman HK, McClure Z, Anderson C, Liu C, Messer M

Influence of Topic Familiarity and Prompt Specificity on Citation Fabrication in Mental Health Research Using Large Language Models: Experimental Study

JMIR Ment Health 2025;12:e80371

DOI: 10.2196/80371

PMID: 41223407

PMCID: 12658395

Citation Fabrication by Large Language Models in Mental Health Research: Experimental Study of Topic Familiarity and Prompt Specificity

  • Jake Linardon; 
  • Hannah K Jarman; 
  • Zoe McClure; 
  • Cleo Anderson; 
  • Claudia Liu; 
  • Mariel Messer

ABSTRACT

Background:

Mental health researchers are increasingly using large language models (LLMs) to improve efficiency, yet these tools can generate fabricated but plausible-sounding content (“hallucinations”). A notable form of hallucination involves fabricated bibliographic citations that cannot be traced to real publications. Although prior studies have explored citation fabrication across disciplines, it remains unclear whether citation accuracy in LLM output systematically varies across topics within the same field that differ in public visibility, scientific maturity, and specialization.

Objective:

This study examined citation fabrication and accuracy in ChatGPT-4o (Omni) outputs by varying prompts to reflect topic areas within mental health that differ in public awareness and the depth of existing scientific literature.

Methods:

GPT-4o was prompted to generate six literature reviews on three mental disorders that vary in public awareness and scientific maturity (major depressive disorder, binge-eating disorder, and body dysmorphic disorder), at two levels of specificity (general overview vs. efficacy of digital interventions). Prompts were standardized except for the target disorder. All citations were extracted and verified. Citations were classified as fabricated if no matching source could be identified. Fabrication and accuracy rates were analyzed descriptively and compared across disorders and prompt types using chi-square tests.

Results:

Thirty-five (19.9%) of the 176 citations were fabricated. Of the 141 real citations, 64 (45.4%) contained errors, most commonly involving the digital object identifier. Fabrication rates differed significantly by disorder (χ² = 13.65 [df = 2], P = .001), with higher rates for binge-eating disorder (28.3%) and body dysmorphic disorder (29.2%) than for major depressive disorder (5.8%). While review specificity did not significantly affect fabrication overall (χ² = 1.57 [df = 1], P = .209), stratified analyses showed higher fabrication rate in specialized (45.8%) vs. general (16.7%) reviews for binge-eating disorder. Citation accuracy was lowest for body dysmorphic disorder – particularly in general reviews – and highest for major depressive disorder.

Conclusions:

Citation fabrication and error rates in GPT-4o outputs varied by topic familiarity and prompt specificity, with more accurate citations observed for widely studied, publicly recognized disorders. Present findings highlight the importance of careful prompt design and human oversight when using LLMs for scholarly work.


 Citation

Please cite as:

Linardon J, Jarman HK, McClure Z, Anderson C, Liu C, Messer M

Influence of Topic Familiarity and Prompt Specificity on Citation Fabrication in Mental Health Research Using Large Language Models: Experimental Study

JMIR Ment Health 2025;12:e80371

DOI: 10.2196/80371

PMID: 41223407

PMCID: 12658395

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.