Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: May 7, 2025
Date Accepted: Oct 14, 2025
Date Submitted to PubMed: Oct 14, 2025
CRITICAL APPRAISAL TOOLS FOR ARTIFICIAL INTELLIGENCE CLINICAL STUDIES: A SCOPING REVIEW.
ABSTRACT
Background:
Health research that uses predictive and/or generative AI is rapidly growing. Just as in traditional clinical studies, the way in which AI studies are conducted can introduce systematic errors. Transmission of this AI evidence into clinical practice and research needs critical appraisal tools for clinical decision makers and researchers.
Objective:
To identify existing tools for critical appraisal of clinical studies that use artificial intelligence (AI) and examine the concepts and domains these tools explore.
Methods:
Inclusion criteria in PCC framework P: (population) Artificial intelligence clinical studies. C (Concept): tools for critical appraisal and associated constructs such as: quality, reporting, validity, risk of bias, and applicability. C (context): in clinical practice context. In addition, bias classification and Chatbot assessment studies were included. We searched in medical and engineering databases (MEDLINE, EMBASE, CINAHL, PsycINFO and IEEE). We included clinical primary research with tools for critical appraisal. Classic reviews and systematic reviews were included in first phase of screening. They were excluded in the secondary phase, after identifying new tools by forward snowballing. We excluded non-human, computer and mathematical research, and letters, opinion papers and editorials. We used Rayyan for screening. Data extraction was done by two observers and discrepancies were solved by discussion. The protocol was previously registered in OSF (https://doi.org/10.17605/OSF.IO/ETYDS). We adhered to the PRISMA extension for Scoping reviews and to the PRISMA-Search extension for Reporting Literature in Systematic Reviews.
Results:
We retrieved 4393 unique records for screening. After excluding 3803 records, 119 were selected for full-text screening. From these, 59 were excluded. After inclusion of 10 studies via other methods, a total of 70 records were finally included. 46 of them were reporting guidelines (15 tools for critical appraisal, 2 for quality of study and 2 for risk of bias). Nine papers ware focused on bias classification or mitigation. We found 15 Chatbots assessment studies or systematic reviews of Chatbots studies (6 and 9 respectively) which are a very heterogeneous group.
Conclusions:
The results picture a landscape of the evidence tools where reporting tools predominate, followed by critical appraisal and risk of bias tools, and few tools for risk of bias. The mismatch of bias in AI and epidemiology should be considered for critical appraisal, especially regarding fairness and the mitigation bias in the AI. Finally, Chatbot assessment studies is a vast and evolving field in which progress in design, reporting and critical appraisal is necessary and urgent. Clinical Trial: https://doi.org/10.17605/OSF.IO/ETYDS
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.