A Critical Assessment of Large Language Models for Systematic Reviews: Utilizing GPT for Complex Data Extraction
ABSTRACT
Background:
Systematic literature reviews are foundational for synthesizing evidence across diverse fields, with particular importance in guiding research and practice in health and biomedical sciences. However, they are labor-intensive due to manual data extraction from multiple studies. As large language models (LLMs) gain attention for their potential to automate research tasks, understanding their ability to accurately extract information from academic papers is critical for advancing systematic reviews.
Objective:
While previous research has assessed LLMs’ ability to extract basic information, our study aims to explore their capability to extract both explicitly outlined study characteristics and deeper, more contextual information requiring nuanced evaluations, using ChatGPT (GPT-4).
Methods:
Screening the full text of a sample of COVID-19 modeling studies, we analyzed three basic measures of study settings (i.e., analysis location, modeling approach, and analyzed interventions) and three complex measures of behavioral components in models (i.e., mobility, risk perception, and compliance). To extract data on these measures, two researchers independently conducted 60 manual codings and compared them with 420 queries spanning seven iterations.
Results:
ChatGPT demonstrated 72% overall accuracy in extracting 60 data elements, performing better in extracting explicitly stated study settings (93%) than subjective behavioral components (50%). While ChatGPT’s accuracy improved as prompts were refined, varying accuracy across measures highlights its limitations.
Conclusions:
We underscore LLMs’ utility in systematic reviews for basic, explicit data extraction but reveal significant limitations in handling nuanced, subjective criteria, emphasizing the current necessity for human oversight.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.