Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Oct 13, 2024
Date Accepted: Feb 20, 2025

The final, peer-reviewed published version of this preprint can be found here:

Accuracy of Large Language Models for Literature Screening in Thoracic Surgery: Diagnostic Study

Dai ZY, Wang FQ, Shen C, Ji YL, Li ZY, Wang Y, Pu Q

Accuracy of Large Language Models for Literature Screening in Thoracic Surgery: Diagnostic Study

J Med Internet Res 2025;27:e67488

DOI: 10.2196/67488

PMID: 40068152

PMCID: 11937709

Accuracy of Large Language Models for Literature Screening in Systematic Reviews and meta-analyses: A Diagnostic Study

  • Zhang-Yi Dai; 
  • Fu-Qiang Wang; 
  • Cheng Shen; 
  • Yan-Li Ji; 
  • Zhi-Yang Li; 
  • Yun Wang; 
  • Qiang Pu

ABSTRACT

Background:

Systematic reviews and meta-analyses rely on labor-intensive literature screening. While machine learning offers potential automation, its accuracy remains suboptimal. This raises the question of whether emerging large language models (LLMs) can provide a more accurate and efficient approach.

Objective:

To evaluate the sensitivity, specificity and summary receiver operating characteristic (SROC) curve of LLM-assisted literature screening

Methods:

We conducted a diagnostic study comparing the accuracy of LLM-assisted versus manual literature screening across six thoracic surgery meta-analyses. Manual screening by two investigators served as the reference standard. LLM-assisted screening was performed using ChatGPT-4o and Claude-3.5 sonnet, with discrepancies resolved by Gemini-1.5 pro. Two open-source, machine learning-based screening tools, ASReview and Abstrackr, were also evaluated. We calculated sensitivity, specificity, and 95% CIs for title/abstract and full-text screening, generating pooled estimates and SROC curves. LLM prompts were revised based on a post hoc error analysis.

Results:

LLM-assisted full-text screening demonstrated high pooled sensitivity (0.87 [95% CI: 0.77–0.99]) and specificity (0.96 [95% CI: 0.91–0.98]), with an AUC of 0.96 (95% CI: 0.94–0.97). Title/abstract screening achieved pooled sensitivity of 0.73 (95% CI: 0.57–0.85) and specificity of 0.99 (95% CI: 0.97–0.99), with an AUC of 0.97 (95% CI: 0.96–0.99). Post hoc revisions improved sensitivity to 0.98 (95% CI: 0.74–1.00) while maintaining high specificity (0.98 [95% CI: 0.94–0.99]). In comparison, the pooled sensitivity and specificity of ASReview tool-assisted screening were 0.58 (95% CI: 0.53-0.64) and 0.97 (95% CI: 0.91-0.99), respectively, with an AUC of 0.66 (95% CI: 0.62-0.70). The pooled sensitivity and specificity of Abstrackr tool-assisted screening were 0.48 (95% CI: 0.35-0.62) and 0.96 (95% CI: 0.88-0.99), respectively, with an AUC of 0.78 (95% CI: 0.74-0.82). A post hoc meta-analysis revealed comparable effect sizes between LLM-assisted and conventional screening.

Conclusions:

LLMs hold significant potential for streamlining literature screening in systematic reviews, reducing workload without sacrificing quality. Importantly, LLMs outperformed traditional machine learning-based tools (ASReview and Abstrackr) in both sensitivity and AUC values, suggesting that LLMs offer a more accurate and efficient approach to literature screening. Clinical Trial: Not Applicable


 Citation

Please cite as:

Dai ZY, Wang FQ, Shen C, Ji YL, Li ZY, Wang Y, Pu Q

Accuracy of Large Language Models for Literature Screening in Thoracic Surgery: Diagnostic Study

J Med Internet Res 2025;27:e67488

DOI: 10.2196/67488

PMID: 40068152

PMCID: 11937709

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.