JMIR Preprints #88766: Performance of AI tools in citing retracted literature

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Performance of AI tools in citing retracted literature

Sebastian Labenbacher;
Maximilian Niederer;
Sascha Hammer;
Matthias Bader;
Nikolaus Schreiber;
Helmar Bornemann-Cimenti

ABSTRACT

Background:

Artificial intelligence is increasingly used in scientific research to generate, refine, and summarize literature. Its ability to process large datasets promises greater efficiency in evidence synthesis and review. However, generative AI tools often produce inaccurate results and may cite retracted or unreliable studies without warning, posing risks to research integrity. Whether these systems can reliably detect and exclude retracted publications remains unclear.

Objective:

In this pragmatic trial nine, freely available generative AI tools have been tested for their ability to answer question without citing retracted literature.

Methods:

Each generative AI was asked five standardized questions about 15 different retracted articles. The articles were chosen from the Retraction Watch-database, including most cited and most recent retracted articles. All questions were repeated twice to assess consistency, and answers were rated for accuracy and reliability.

Results:

None of the nine AI tools consistently identified or excluded retracted articles. ChatGPT-5 performed best (8/15, (53.3%) correct), while SciSpace, ScienceO S, and Consensus showed no fully correct results. Microsoft Copilot achieved the highest topic-overview accuracy (87%), and ChatGPT-4 showed the greatest consistency (97.2%). OpenEvidence performed reliably within medical literature but reached perfect accuracy in only 2 of 13 (15.4%) cases.

Conclusions:

No free generative AI tool can reliably detect or exclude retracted studies. Even the best systems missed a substantial proportion of retracted articles. Until retraction-aware verification is integrated, independent source checking remains essential to preserve research integrity. Clinical Trial: https://doi.org/10.17605/OSF.IO/B6J2W

Citation

Please cite as:

Labenbacher S, Niederer M, Hammer S, Bader M, Schreiber N, Bornemann-Cimenti H

Performance of AI Tools in Citing Retracted Literature : Content Analysis

J Med Internet Res 2026;28:e88766

DOI: 10.2196/88766

PMID: 42066286

PMCID: 13134821

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Dec 1, 2025

Date Accepted: Mar 28, 2026

Performance of AI tools in citing retracted literature

ABSTRACT

Citation

Copyright