JMIR Preprints #99716: Large Language Model Assisted Risk of Bias Assessment for Studies in Environmental Health: Implementation Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Large Language Model Assisted Risk of Bias Assessment for Studies in Environmental Health: Implementation Study

Gaelen P. Adam

ABSTRACT

Background:

The assessment of the risk of bias (RoB) of each study is a necessary but labor-intensive step in conducting a systematic review (SR). There is a potential for artificial intelligence (AI) to reduce the time and effort needed in RoB assessment and improve consistency between reviewers, but little is known about the accuracy of AI RoB assessments.

Objective:

In this study, we aim to evaluate the performance of a large language model (LLM), when given the questions from the RoB tool and the relevant passages from the full-text articles, in assessing RoB in observational studies of environmental exposures and health outcomes.

Methods:

We evaluated the performance of two LLMs (GPT-5-mini and Google Gemini-3) on 128 observational studies from a SR of per- and polyfluoroalkyl substances (PFAS) and health outcomes conducted by the National Academies of Sciences. The Navigation Guide (NavGuide), a domain-based RoB tool designed for environment studies, was used. For each article-domain pair, the LLM was provided with the questions and guidance from the Navigation Guide and the human-identified text passages addressing that domain. The LLM returned a structured RoB rating, which were compared to the human-adjudicated ratings to quantify agreement and identify patterns of discrepancy. The protocol was prospectively registered through the Open Science Framework.

Results:

The LLMs demonstrated moderate agreement with human consensus RoB assessments for exact matches (51% to 65%), but this remained lower than the agreement among humans (88% to 91%). Performance improved substantially for partial matches, reflecting agreement in direction but not magnitude (e.g., "low" vs. "probably low"), with percent agreement of 92% and 96% for the two LLMs (98% to 99% for humans). Performance varied across the NavGuide domains, with the worst performance in domain 1 (selection bias) and the best in domain 8 (conflict of interest). The models tended to be more conservative than humans, often assessing higher risk of bias ratings (e.g., “probably low” when humans assigned “low”).

Conclusions:

Our results suggest that, although agreement was moderate and inconsistent across domains, LLM-based RoB assessment may be sufficiently accurate for use as a second reviewer with human oversight. Clinical Trial: The protocol was prospectively registered through the Open Science Framework: https://osf.io/srtqd/overview?view_only=068261b912fe40308e4f199520d6c16f

Citation

Please cite as:

Adam GP

Large Language Model Assisted Risk of Bias Assessment for Studies in Environmental Health: Implementation Study

JMIR Preprints. 28/04/2026:99716

DOI: 10.2196/preprints.99716

URL: https://preprints.jmir.org/preprint/99716

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Currently submitted to: JMIR Bioinformatics and Biotechnology

Date Submitted: Apr 28, 2026

Open Peer Review Period: May 7, 2026 - Jul 2, 2026

(currently open for review)

Large Language Model Assisted Risk of Bias Assessment for Studies in Environmental Health: Implementation Study

ABSTRACT

Citation

Copyright