Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Currently accepted at: Journal of Medical Internet Research

Date Submitted: Dec 1, 2025
Date Accepted: Mar 28, 2026

This paper has been accepted and is currently in production.

It will appear shortly on 10.2196/88766

The final accepted version (not copyedited yet) is in this tab.

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Performance of AI tools in citing retracted literature

  • Sebastian Labenbacher; 
  • Maximilian Niederer; 
  • Sascha Hammer; 
  • Matthias Bader; 
  • Nikolaus Schreiber; 
  • Helmar Bornemann-Cimenti

ABSTRACT

Background:

Artificial intelligence is increasingly used in scientific research to generate, refine, and summarize literature. Its ability to process large datasets promises greater efficiency in evidence synthesis and review. However, generative AI tools often produce inaccurate results and may cite retracted or unreliable studies without warning, posing risks to research integrity. Whether these systems can reliably detect and exclude retracted publications remains unclear.

Objective:

In this pragmatic trial nine, freely available generative AI tools have been tested for their ability to answer question without citing retracted literature.

Methods:

Each generative AI was asked five standardized questions about 15 different retracted articles. The articles were chosen from the Retraction Watch-database, including most cited and most recent retracted articles. All questions were repeated twice to assess consistency, and answers were rated for accuracy and reliability.

Results:

None of the nine AI tools consistently identified or excluded retracted articles. ChatGPT-5 performed best (8/15, (53.3%) correct), while SciSpace, ScienceO S, and Consensus showed no fully correct results. Microsoft Copilot achieved the highest topic-overview accuracy (87%), and ChatGPT-4 showed the greatest consistency (97.2%). OpenEvidence performed reliably within medical literature but reached perfect accuracy in only 2 of 13 (15.4%) cases.

Conclusions:

No free generative AI tool can reliably detect or exclude retracted studies. Even the best systems missed a substantial proportion of retracted articles. Until retraction-aware verification is integrated, independent source checking remains essential to preserve research integrity. Clinical Trial: https://doi.org/10.17605/OSF.IO/B6J2W


 Citation

Please cite as:

Labenbacher S, Niederer M, Hammer S, Bader M, Schreiber N, Bornemann-Cimenti H

Performance of AI tools in citing retracted literature

JMIR Preprints. 01/12/2025:88766

DOI: 10.2196/preprints.88766

URL: https://preprints.jmir.org/preprint/88766

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.