JMIR Preprints #68097: A Critical Assessment of Large Language Models for Systematic Reviews: Utilizing GPT for Complex Data Extraction

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

A Critical Assessment of Large Language Models for Systematic Reviews: Utilizing GPT for Complex Data Extraction

Hesam Mahmoudi;
Doris Chang;
Hannah Lee;
Navid Ghaffarzadegan;
Mohammad Jalali

ABSTRACT

Background:

Systematic literature reviews are foundational for synthesizing evidence across diverse fields, with particular importance in guiding research and practice in health and biomedical sciences. However, they are labor-intensive due to manual data extraction from multiple studies. As large language models (LLMs) gain attention for their potential to automate research tasks, understanding their ability to accurately extract information from academic papers is critical for advancing systematic reviews.

Objective:

While previous research has assessed LLMs’ ability to extract basic information, our study aims to explore their capability to extract both explicitly outlined study characteristics and deeper, more contextual information requiring nuanced evaluations, using ChatGPT (GPT-4).

Methods:

Screening the full text of a sample of COVID-19 modeling studies, we analyzed three basic measures of study settings (i.e., analysis location, modeling approach, and analyzed interventions) and three complex measures of behavioral components in models (i.e., mobility, risk perception, and compliance). To extract data on these measures, two researchers independently conducted 60 manual codings and compared them with 420 queries spanning seven iterations.

Results:

ChatGPT demonstrated 72% overall accuracy in extracting 60 data elements, performing better in extracting explicitly stated study settings (93%) than subjective behavioral components (50%). While ChatGPT’s accuracy improved as prompts were refined, varying accuracy across measures highlights its limitations.

Conclusions:

We underscore LLMs’ utility in systematic reviews for basic, explicit data extraction but reveal significant limitations in handling nuanced, subjective criteria, emphasizing the current necessity for human oversight.

Citation

Please cite as:

Mahmoudi H, Chang D, Lee H, Ghaffarzadegan N, Jalali M

Critical Assessment of Large Language Models’ (ChatGPT) Performance in Data Extraction for Systematic Reviews: Exploratory Study

JMIR AI 2025;4:e68097

DOI: 10.2196/68097

PMID: 40934529

PMCID: 12425462

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR AI

Date Submitted: Oct 28, 2024

Date Accepted: Apr 14, 2025

A Critical Assessment of Large Language Models for Systematic Reviews: Utilizing GPT for Complex Data Extraction

ABSTRACT

Citation

Copyright