Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Aug 22, 2021
Date Accepted: Oct 12, 2021
Differential biases and variabilities of deep-learning-based artificial intelligence and human experts in clinical diagnosis: A retrospective cohort and survey study
ABSTRACT
Background:
Deep learning (DL) based artificial intelligence may have different diagnostic characteristics from human experts in medical diagnosis. As a data-driven knowledge system, heterogeneous population incidence in the clinical world is considered to cause bias to more DL than clinicians. Conversely, by experiencing limited numbers of cases, human experts may exhibit a large inter-individual variability. Thus, understanding how the two groups classify given data differently is an essential step for the cooperative usage of DL in clinical application.
Objective:
Evaluate and compare the differential effects of clinical experiences in otoendoscopic image diagnosis in both computers and physicians exemplified by class imbalance problem and guide clinicians when utilizing decision support systems.
Methods:
Digital otoendoscopic images of patients who visited the outpatient clinic in Severance Hospital, Seoul, South Korea, department of otorhinolaryngology, from January 2013 to June 2019, a total of 22,707 otoendoscopic images. We excluded similar images, and 7,500 otoendoscopic images were selected for labeling. We built a DL based image classification model to classify the given image into six disease categories. Two tests sets of 300 images were populated, a balanced and imbalanced test set. 14 clinicians (Otolaryngologists, non-otolaryngology specialists including general practitioners) and 13 deep learning-based models were lined up. We used accuracy (overall and per-class) and kappa statistics to compare individual physicians and ML model’s results.
Results:
Our ML models had consistently high accuracies(77.14±1.83% in balanced, 82.03±3.06% in imbalanced test set) equivalent to otolaryngologists (71.17±3.37% in balanced, 72.83±6.41% in imbalanced), and far better accuracy compared to non-otolaryngologists (45.63±7.89% in balanced, 44.08±15.83% in imbalanced). However, ML models suffered from class imbalance problems (77.14±1.83% vs 82.03±3.06% in the balanced and imbalanced test set, respectively). This was mitigated by data augmentation, particularly for low incidence classes but still had low per-class accuracies in rare disease classes. Human physicians, despite being less affected by prevalence, showed high inter-physician variability. (kappa=0.83±0.02 vs 0.60±0.07 in ML models, otolaryngologists, respectively)
Conclusions:
Even though ML models deliver excellent performance in classifying ear disease, physicians and ML models have their own strengths. To deliver the best patient care in the shortage of otolaryngologists, our ML model can serve a cooperative role for clinicians with diverse expertise, as long as keeping in mind that models could be biased toward prevalent diseases even after data augmentation. Clinical Trial: Human-machine cooperation; Convolutional neural network; Deep learning, Class imbalance problem; Otoscopy; Eardrum; Artificial intelligence; Otology; Computer-aided diagnosis
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.