JMIR Preprints #87057: The Alberta Risk of Bias Assessment Tool (AQAT:RoB) for the Evaluation of Medical Large Language Model Question-Answer Studies: Development and Pilot Validation

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

The Alberta Risk of Bias Assessment Tool (AQAT:RoB) for the Evaluation of Medical Large Language Model Question-Answer Studies: Development and Pilot Validation

Carrie Ye;
Joseph Ross Mitchell;
Daniel C. Baumgart;
Zechen Ma;
Angela Lim Fung;
Daniela Garcia Orellana;
Juel Chowdhury;
Abass Abdullah;
Steven Katz;
Jacob L. Jaremko;
Pierre Boulanger;
Claire E.H. Barber;
Gillian Lemermeyer;
Hosna Jabbari;
Lili Mou;
Maryam Mirzaei;
Mary Waithera Beckett Githumbi;
Puneeta Tandon;
Randy Goebel;
Rhys Clark;
Whitney Hung;
Marjan Abbasi;
Farhad Maleki;
Scott Klarenbach;
Mohamed Abdalla

ABSTRACT

Background:

Despite the transformative potential of Large Language Models (LLMs) in healthcare, the rapid development of these tools has outpaced their rigorous evaluation. Existing risk-of-bias tools for medical research are not well-suited for the unique challenges of evaluating LLM Question-Answer (LLM-QA) studies, which creates a critical gap in assessing their safety and effectiveness.

Objective:

To develop the Alberta Risk of Bias Assessment Tool for LLM-QA studies (AQAT:RoB) to systematically evaluate validity and risk of bias of LLM-QA studies.

Methods:

We conducted a literature review to identify the breadth of medical LLM-QA studies. Based on these studies, a draft AQAT:ROB was created for further refinement through a pre-specified iterative process of modified-Delphi, consensus meeting, and validation. The first Delphi process occurred between May 1 and May 20, 2025, and the first consensus meeting was held on May 22. The first round of validation was completed by 4 evaluators, who were not part of the development process, on 16 randomly selected studies. As this first round of validation surpassed our a priori threshold of ≥80% agreement and ≥Cohen’s Kappa of 0.61 between evaluators, no further rounds of development and validation were undertaken.

Results:

The AQAT:RoB consists of seven high level domains (Questions, Reference Answers, LLM Answers, Evaluators, Outcomes, Reporting, and Other). These domains are sub-divided into 12 sub-domains. Each sub-domain includes at least one “Support for Judgement” and at least one “Type of Bias” and are to be rated “low”, “high” or “unclear” for risk of bias. Evaluation by independent assessors showed a percent agreement of 82.8% and a Cohen’s Kappa of 0.63 between assessors.

Conclusions:

The AQAT:RoB is a reliable tool for assessing the validity/risk of bias of LLM-QA studies.

Citation

Please cite as:

Ye C, Mitchell JR, Baumgart DC, Ma Z, Fung AL, Orellana DG, Chowdhury J, Abdullah A, Katz S, Jaremko JL, Boulanger P, Barber CE, Lemermeyer G, Jabbari H, Mou L, Mirzaei M, Githumbi MWB, Tandon P, Goebel R, Clark R, Hung W, Abbasi M, Maleki F, Klarenbach S, Abdalla M

The Alberta Quality Assessment Tool: Risk of Bias (AQAT:RoB) for the Evaluation of Medical Large Language Model Question-Answer Studies: Development and Pilot Validation

J Med Internet Res 2026;28:e87057

DOI: 10.2196/87057

PMID: 41950508

PMCID: 13061365

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Nov 4, 2025

Date Accepted: Feb 24, 2026

Date Submitted to PubMed: Feb 26, 2026

The Alberta Risk of Bias Assessment Tool (AQAT:RoB) for the Evaluation of Medical Large Language Model Question-Answer Studies: Development and Pilot Validation

ABSTRACT

Citation

Copyright