Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Nov 4, 2025
Date Accepted: Feb 24, 2026
Date Submitted to PubMed: Feb 26, 2026
The Alberta Risk of Bias Assessment Tool (AQAT:RoB) for the Evaluation of Medical Large Language Model Question-Answer Studies: Development and Pilot Validation
ABSTRACT
Background:
Despite the transformative potential of Large Language Models (LLMs) in healthcare, the rapid development of these tools has outpaced their rigorous evaluation. Existing risk-of-bias tools for medical research are not well-suited for the unique challenges of evaluating LLM Question-Answer (LLM-QA) studies, which creates a critical gap in assessing their safety and effectiveness.
Objective:
To develop the Alberta Risk of Bias Assessment Tool for LLM-QA studies (AQAT:RoB) to systematically evaluate validity and risk of bias of LLM-QA studies.
Methods:
We conducted a literature review to identify the breadth of medical LLM-QA studies. Based on these studies, a draft AQAT:ROB was created for further refinement through a pre-specified iterative process of modified-Delphi, consensus meeting, and validation. The first Delphi process occurred between May 1 and May 20, 2025, and the first consensus meeting was held on May 22. The first round of validation was completed by 4 evaluators, who were not part of the development process, on 16 randomly selected studies. As this first round of validation surpassed our a priori threshold of ≥80% agreement and ≥Cohen’s Kappa of 0.61 between evaluators, no further rounds of development and validation were undertaken.
Results:
The AQAT:RoB consists of seven high level domains (Questions, Reference Answers, LLM Answers, Evaluators, Outcomes, Reporting, and Other). These domains are sub-divided into 12 sub-domains. Each sub-domain includes at least one “Support for Judgement” and at least one “Type of Bias” and are to be rated “low”, “high” or “unclear” for risk of bias. Evaluation by independent assessors showed a percent agreement of 82.8% and a Cohen’s Kappa of 0.63 between assessors.
Conclusions:
The AQAT:RoB is a reliable tool for assessing the validity/risk of bias of LLM-QA studies.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.