Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Evaluating the Accuracy of a New Artificial Intelligence Based Symptom Checker: A Clinical Vignette Study
ABSTRACT
Background:
Medical self-diagnostic tools (or symptom checkers) are becoming an integral part of digital health and our daily lives. More precisely, patients are increasingly using them to understand the underlying causes of their symptoms. As such, it becomes essential to rigorously investigate and report their performance using standard clinical and scientific approaches.
Objective:
To evaluate the accuracy of a new Artificial Intelligence (AI) based symptom checker and compare it against that of some popular symptom checkers and seasoned primary care physicians.
Methods:
We propose a 4-stage comprehensive experimentation methodology that capitalizes on the standard clinical vignette approach to evaluate 6 symptom checkers. To this end, we developed and peer-reviewed 400 vignettes, each approved by at least 5 out of 7 independent and experienced primary care physicians. To establish a frame of reference and interpret the results of symptom checkers accordingly, we further compared the best-performing symptom checker against 3 primary care physicians with an average experience of 16.6 years. For measuring accuracy, we used 7 standard metrics, including (a) M1 as a measure of a symptom checker’s or a physician’s ability to return a vignette’s main diagnosis at the top of their differential list, (b) F1-score as a trade-off measure between sensitivity and precision, and (c) NDCG as a measure of a differential list’s ranking quality, among others.
Results:
The new AI-based symptom checker, namely, Avey significantly outperformed 5 popular symptom checkers, namely, Ada, WebMD, K-Health, Buoy, and Babylon by averages of 24.5%, 175.5%, 142.8%, 159.6%, 2968.1% using M1; 8.7%, 88.9%, 66.4%, 88.9%, 2084% using F1-score; and 21.2%, 93.4%, 113.3%, 136.4%, 3091.6% using NDCG, respectively. In contrast, physicians slightly outpaced Avey by an average of 1.2% using F1-score, while Avey exceeded them by averages of 10.2% and 25.1% using M1 and NDCG, respectively.
Conclusions:
Avey demonstrated a superior performance against the 5 considered symptom checkers and compared favorably to a panel of experienced and independent physicians.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.