Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: May 14, 2023
Date Accepted: Sep 28, 2023

The final, peer-reviewed published version of this preprint can be found here:

Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study

Guo E, Gupta M, Deng J, Park YJ, Paget M, Naugler C

Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study

J Med Internet Res 2024;26:e48996

DOI: 10.2196/48996

PMID: 38214966

PMCID: 10818236

Automated Paper Screening for Clinical Reviews: An Analysis Using Large Language Models

  • Eddie Guo; 
  • Mehul Gupta; 
  • Jiawen Deng; 
  • Ye-Jean Park; 
  • Mike Paget; 
  • Christopher Naugler

ABSTRACT

Background:

The systematic review process, particularly the stage of screening titles and abstracts for relevance, is a labor-intensive task susceptible to inadvertent human errors and subjective biases. Advancements in natural language processing algorithms, especially large language models, offer promising solutions for the semi-automation of this task.

Objective:

To assess the performance of the OpenAI ChatGPT and GPT-4 APIs in accurately and efficiently identifying relevant titles and abstracts from real-world clinical review datasets and compare its performance against ground truth labelling by two independent human reviewers.

Methods:

We introduce a novel workflow using the ChatGPT and GPT-4 APIs for screening titles and abstracts in clinical reviews. A Python script was created to make calls to the API with the screening criteria in natural language, and a corpus of title and abstract datasets that have been filtered by a minimum of two human reviewers. We compared the performance of our model against human-reviewed papers across six review papers, screening over 24,000 titles and abstracts.

Results:

Our results show an accuracy of 0.91, a sensitivity of excluded papers of 0.91, and a sensitivity of included papers of 0.76. The inter-rater variability between two independent human screeners was kappa=0.46, and the prevalence and bias-adjusted kappa between our proposed methods and the consensus-based human decisions was kappa=0.96. On a randomly selected subset of papers, the GPT models demonstrated the ability to provide reasoning for its decisions and corrected its initial decision upon being asked to explain its reasoning for incorrect classifications.

Conclusions:

Large language models have the potential to streamline the clinical review process, save valuable time and effort for researchers, and contribute to the overall quality of clinical reviews. By prioritizing the workflow and acting as an aid rather than a replacement for researchers and reviewers, models such as GPT-4 can enhance efficiency and lead to more accurate and reliable conclusions in medical research.


 Citation

Please cite as:

Guo E, Gupta M, Deng J, Park YJ, Paget M, Naugler C

Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study

J Med Internet Res 2024;26:e48996

DOI: 10.2196/48996

PMID: 38214966

PMCID: 10818236

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.