Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR AI

Date Submitted: Oct 28, 2024
Date Accepted: Apr 14, 2025

The final, peer-reviewed published version of this preprint can be found here:

Critical Assessment of Large Language Models’ (ChatGPT) Performance in Data Extraction for Systematic Reviews: Exploratory Study

Mahmoudi H, Chang D, Lee H, Ghaffarzadegan N, Jalali M

Critical Assessment of Large Language Models’ (ChatGPT) Performance in Data Extraction for Systematic Reviews: Exploratory Study

JMIR AI 2025;4:e68097

DOI: 10.2196/68097

PMID: 40934529

PMCID: 12425462

A Critical Assessment of Large Language Models for Systematic Reviews: Utilizing GPT for Complex Data Extraction

  • Hesam Mahmoudi; 
  • Doris Chang; 
  • Hannah Lee; 
  • Navid Ghaffarzadegan; 
  • Mohammad Jalali

ABSTRACT

Background:

Systematic literature reviews are foundational for synthesizing evidence across diverse fields, with particular importance in guiding research and practice in health and biomedical sciences. However, they are labor-intensive due to manual data extraction from multiple studies. As large language models (LLMs) gain attention for their potential to automate research tasks, understanding their ability to accurately extract information from academic papers is critical for advancing systematic reviews.

Objective:

While previous research has assessed LLMs’ ability to extract basic information, our study aims to explore their capability to extract both explicitly outlined study characteristics and deeper, more contextual information requiring nuanced evaluations, using ChatGPT (GPT-4).

Methods:

Screening the full text of a sample of COVID-19 modeling studies, we analyzed three basic measures of study settings (i.e., analysis location, modeling approach, and analyzed interventions) and three complex measures of behavioral components in models (i.e., mobility, risk perception, and compliance). To extract data on these measures, two researchers independently conducted 60 manual codings and compared them with 420 queries spanning seven iterations.

Results:

ChatGPT demonstrated 72% overall accuracy in extracting 60 data elements, performing better in extracting explicitly stated study settings (93%) than subjective behavioral components (50%). While ChatGPT’s accuracy improved as prompts were refined, varying accuracy across measures highlights its limitations.

Conclusions:

We underscore LLMs’ utility in systematic reviews for basic, explicit data extraction but reveal significant limitations in handling nuanced, subjective criteria, emphasizing the current necessity for human oversight.


 Citation

Please cite as:

Mahmoudi H, Chang D, Lee H, Ghaffarzadegan N, Jalali M

Critical Assessment of Large Language Models’ (ChatGPT) Performance in Data Extraction for Systematic Reviews: Exploratory Study

JMIR AI 2025;4:e68097

DOI: 10.2196/68097

PMID: 40934529

PMCID: 12425462

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.