Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Nov 4, 2025
Date Accepted: Feb 24, 2026
Date Submitted to PubMed: Feb 26, 2026

The final, peer-reviewed published version of this preprint can be found here:

The Alberta Quality Assessment Tool: Risk of Bias (AQAT:RoB) for the Evaluation of Medical Large Language Model Question-Answer Studies: Development and Pilot Validation

Ye C, Mitchell JR, Baumgart DC, Ma Z, Fung AL, Orellana DG, Chowdhury J, Abdullah A, Katz S, Jaremko JL, Boulanger P, Barber CE, Lemermeyer G, Jabbari H, Mou L, Mirzaei M, Githumbi MWB, Tandon P, Goebel R, Clark R, Hung W, Abbasi M, Maleki F, Klarenbach S, Abdalla M

The Alberta Quality Assessment Tool: Risk of Bias (AQAT:RoB) for the Evaluation of Medical Large Language Model Question-Answer Studies: Development and Pilot Validation

J Med Internet Res 2026;28:e87057

DOI: 10.2196/87057

PMID: 41950508

The Alberta Risk of Bias Assessment Tool (AQAT:RoB) for the Evaluation of Medical Large Language Model Question-Answer Studies: Development and Pilot Validation

  • Carrie Ye; 
  • Joseph Ross Mitchell; 
  • Daniel C. Baumgart; 
  • Zechen Ma; 
  • Angela Lim Fung; 
  • Daniela Garcia Orellana; 
  • Juel Chowdhury; 
  • Abass Abdullah; 
  • Steven Katz; 
  • Jacob L. Jaremko; 
  • Pierre Boulanger; 
  • Claire E.H. Barber; 
  • Gillian Lemermeyer; 
  • Hosna Jabbari; 
  • Lili Mou; 
  • Maryam Mirzaei; 
  • Mary Waithera Beckett Githumbi; 
  • Puneeta Tandon; 
  • Randy Goebel; 
  • Rhys Clark; 
  • Whitney Hung; 
  • Marjan Abbasi; 
  • Farhad Maleki; 
  • Scott Klarenbach; 
  • Mohamed Abdalla

ABSTRACT

Background:

Despite the transformative potential of Large Language Models (LLMs) in healthcare, the rapid development of these tools has outpaced their rigorous evaluation. Existing risk-of-bias tools for medical research are not well-suited for the unique challenges of evaluating LLM Question-Answer (LLM-QA) studies, which creates a critical gap in assessing their safety and effectiveness.

Objective:

To develop the Alberta Risk of Bias Assessment Tool for LLM-QA studies (AQAT:RoB) to systematically evaluate validity and risk of bias of LLM-QA studies.

Methods:

We conducted a literature review to identify the breadth of medical LLM-QA studies. Based on these studies, a draft AQAT:ROB was created for further refinement through a pre-specified iterative process of modified-Delphi, consensus meeting, and validation. The first Delphi process occurred between May 1 and May 20, 2025, and the first consensus meeting was held on May 22. The first round of validation was completed by 4 evaluators, who were not part of the development process, on 16 randomly selected studies. As this first round of validation surpassed our a priori threshold of ≥80% agreement and ≥Cohen’s Kappa of 0.61 between evaluators, no further rounds of development and validation were undertaken.

Results:

The AQAT:RoB consists of seven high level domains (Questions, Reference Answers, LLM Answers, Evaluators, Outcomes, Reporting, and Other). These domains are sub-divided into 12 sub-domains. Each sub-domain includes at least one “Support for Judgement” and at least one “Type of Bias” and are to be rated “low”, “high” or “unclear” for risk of bias. Evaluation by independent assessors showed a percent agreement of 82.8% and a Cohen’s Kappa of 0.63 between assessors.

Conclusions:

The AQAT:RoB is a reliable tool for assessing the validity/risk of bias of LLM-QA studies.


 Citation

Please cite as:

Ye C, Mitchell JR, Baumgart DC, Ma Z, Fung AL, Orellana DG, Chowdhury J, Abdullah A, Katz S, Jaremko JL, Boulanger P, Barber CE, Lemermeyer G, Jabbari H, Mou L, Mirzaei M, Githumbi MWB, Tandon P, Goebel R, Clark R, Hung W, Abbasi M, Maleki F, Klarenbach S, Abdalla M

The Alberta Quality Assessment Tool: Risk of Bias (AQAT:RoB) for the Evaluation of Medical Large Language Model Question-Answer Studies: Development and Pilot Validation

J Med Internet Res 2026;28:e87057

DOI: 10.2196/87057

PMID: 41950508

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.