JMIR Preprints #67488: Accuracy of Large Language Models for Literature Screening in Systematic Reviews and meta-analyses: A Diagnostic Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Accuracy of Large Language Models for Literature Screening in Systematic Reviews and meta-analyses: A Diagnostic Study

Zhang-Yi Dai;
Fu-Qiang Wang;
Cheng Shen;
Yan-Li Ji;
Zhi-Yang Li;
Yun Wang;
Qiang Pu

ABSTRACT

Background:

Systematic reviews and meta-analyses rely on labor-intensive literature screening. While machine learning offers potential automation, its accuracy remains suboptimal. This raises the question of whether emerging large language models (LLMs) can provide a more accurate and efficient approach.

Objective:

To evaluate the sensitivity, specificity and summary receiver operating characteristic (SROC) curve of LLM-assisted literature screening

Methods:

We conducted a diagnostic study comparing the accuracy of LLM-assisted versus manual literature screening across six thoracic surgery meta-analyses. Manual screening by two investigators served as the reference standard. LLM-assisted screening was performed using ChatGPT-4o and Claude-3.5 sonnet, with discrepancies resolved by Gemini-1.5 pro. Two open-source, machine learning-based screening tools, ASReview and Abstrackr, were also evaluated. We calculated sensitivity, specificity, and 95% CIs for title/abstract and full-text screening, generating pooled estimates and SROC curves. LLM prompts were revised based on a post hoc error analysis.

Results:

LLM-assisted full-text screening demonstrated high pooled sensitivity (0.87 [95% CI: 0.77–0.99]) and specificity (0.96 [95% CI: 0.91–0.98]), with an AUC of 0.96 (95% CI: 0.94–0.97). Title/abstract screening achieved pooled sensitivity of 0.73 (95% CI: 0.57–0.85) and specificity of 0.99 (95% CI: 0.97–0.99), with an AUC of 0.97 (95% CI: 0.96–0.99). Post hoc revisions improved sensitivity to 0.98 (95% CI: 0.74–1.00) while maintaining high specificity (0.98 [95% CI: 0.94–0.99]). In comparison, the pooled sensitivity and specificity of ASReview tool-assisted screening were 0.58 (95% CI: 0.53-0.64) and 0.97 (95% CI: 0.91-0.99), respectively, with an AUC of 0.66 (95% CI: 0.62-0.70). The pooled sensitivity and specificity of Abstrackr tool-assisted screening were 0.48 (95% CI: 0.35-0.62) and 0.96 (95% CI: 0.88-0.99), respectively, with an AUC of 0.78 (95% CI: 0.74-0.82). A post hoc meta-analysis revealed comparable effect sizes between LLM-assisted and conventional screening.

Conclusions:

LLMs hold significant potential for streamlining literature screening in systematic reviews, reducing workload without sacrificing quality. Importantly, LLMs outperformed traditional machine learning-based tools (ASReview and Abstrackr) in both sensitivity and AUC values, suggesting that LLMs offer a more accurate and efficient approach to literature screening. Clinical Trial: Not Applicable

Citation

Please cite as:

Dai ZY, Wang FQ, Shen C, Ji YL, Li ZY, Wang Y, Pu Q

Accuracy of Large Language Models for Literature Screening in Thoracic Surgery: Diagnostic Study

J Med Internet Res 2025;27:e67488

DOI: 10.2196/67488

PMID: 40068152

PMCID: 11937709

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Oct 13, 2024

Date Accepted: Feb 20, 2025

Accuracy of Large Language Models for Literature Screening in Systematic Reviews and meta-analyses: A Diagnostic Study

ABSTRACT

Citation

Copyright