Accepted for/Published in: Interactive Journal of Medical Research
Date Submitted: Nov 19, 2023
Date Accepted: Jan 26, 2024
Date Submitted to PubMed: Jan 26, 2024
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
METRICS: Establishing a Preliminary Checklist to Standardize Design and Reporting of Artificial Intelligence-Based Studies in Healthcare
ABSTRACT
Background:
Adherence to evidence-based practice is indispensable in healthcare. Recently, the utility of artificial intelligence (AI)-based models in healthcare has been evaluated extensively. However, the lack of consensus guidelines for design and reporting of findings in these studies pose challenges to interpretation and synthesis of evidence.
Objective:
To propose a preliminary framework forming the basis of comprehensive guidelines to standardize reporting of AI-based studies in healthcare education and practice.
Methods:
A systematic literature review was conducted on Scopus, PubMed, and Google Scholar. The published records with “ChatGPT”, “Bing”, or “Bard” in the title were retrieved. Careful examination of the methodologies employed in the included records was conducted to identify the common pertinent themes and gaps in reporting. Panel discussion followed to establish a unified and thorough checklist for reporting. Testing of the finalized checklist on the included records was done by two independent raters with Cohen’s κ as the method to evaluate the inter-rater reliability.
Results:
The final dataset that formed the basis for pertinent theme identification and analysis comprised a total of 34 records. The finalized checklist included nine pertinent themes collectively referred to as “METRICS”: (1) Model used and its exact settings; (2) Evaluation approach for the generated content; (3) Timing of testing the model; (4) Transparency of the data source; (5) Range of tested topics; (6) Randomization of selecting the queries; (7) Individual factors in selecting the queries and inter-rater reliability; (8) Count of queries executed to test the model; (9) Specificity of the prompts and language used. The overall mean METRICS score was 3.0±0.58. The tested METRICS score was acceptable by the range of Cohen’s κ of 0.558–0.962 (P<.001 for the nine tested items). Classified per item, the highest average METRICS score was recorded for the “Model” item, followed by “Specificity of the prompts and language used” item, while the lowest scores were recorded for the “Randomization of selecting the queries” item classified as sub-optimal and “Individual factors in selecting the queries and inter-rater reliability” item classified as satisfactory.
Conclusions:
The findings highlighted the need for standardized reporting algorithms for AI-based studies in healthcare based on variability observed in methodologies and reporting. The proposed METRICS checklist could be the preliminary helpful step to establish a universally accepted approach to standardize reporting in AI-based studies in healthcare, a swiftly evolving research topic.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.