Explainability of Convolutional Neural Networks for Dermatological Diagnosis
ABSTRACT
Background:
The state-of-the-art AI tool for dermatological diagnosis is convolutional neural networks (CNNs), which were shown to achieve human-level performance when trained on a representative dataset. CNN explainability is a key factor to adopting such techniques in practice and can be done using attention maps of the network. However, evaluation of CNN explainability has been limited to visual assessment and remains qualitative, subjective, and time-consuming.
Objective:
To provide a framework for objective quantitative assessment of the explainability of CNNs for dermatological diagnosis and benchmark.
Methods:
We sourced 566 images available under the Creative Commons licence from two public datasets, DermNetNZ and SD-260, with the reference diagnoses of acne, actinic keratosis, psoriasis, seborrhoeic dermatitis, viral warts, or vitiligo. Eight dermatologists with teledermatology expertise annotated each image with diagnosis, and diagnosis-supporting characteristics and their localisation. A total of 16 supporting visual characteristics were selected, specifically the basic terms macule, nodule, papule, patch, plaque, pustule and scale, and the additional terms closed comedo, cyst, dermatoglyph disruption, leukotrichia, open comedo, scar, sun damage, telangiectasia and thrombosed capillary. The resulting dataset consisted of 525 images with three rater annotations each. Explainability of two fine-tuned CNN model, ResNet-50 and EfficientNet-B4, was analysed with respect to the reference explanations provided by the dermatologists. Both models were pre-trained on the ImageNet natural image recognition dataset and fine-tuned on 3,214 images of the six target skin conditions from an internal clinical dataset. CNN explanations were obtained as activation maps of the models through gradient-weighted class-activation maps (Grad-CAM). We compute the fuzzy sensitivity of each characteristic attention map with regards to both the fuzzy gold standard characteristic attention fusion masks, and the fuzzy union of all characteristics.
Results:
EfficientNet-B4 explainability was on average higher than that of ResNet-50 in terms of sensitivity for 13 out of 16 supporting characteristics with an average of 0.16±0.05 and 0.24±0.07, respectively, but lower in terms of specificity, which was 0.90±0.00 and 0.82±0.03 for ResNet-50 and EfficientNet-B4, respectively. All measures were within the range of corresponding inter-rater metrics. Please see attached figure and table for illustration and details.
Conclusions:
We objectively benchmarked the explainability power of dermatological diagnosis models through the use of expert-defined supporting characteristics for diagnosis.
Citation

Per the author's request the PDF is not available.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.