Assessing Generalisability of Deep Learning Models Trained on Standardised and non-Standardised IMages and their Performance against Teledermatologists
ABSTRACT
Background:
Convolutional neural networks (CNNs) are a type of artificial intelligence (AI) which show promise as a diagnostic aid for skin cancer. However, the majority are trained using retrospective image datasets of varying quality and image capture standardisation.
Objective:
The objective of our study was to use CNN models with the same architecture, but different training image sets, and test variability in performance when classifying skin cancer images in different populations, acquired with different devices. Additionally, we wanted to assess the performance of the models against Danish tele-dermatologists, when tested on images acquired from Denmark.
Methods:
Three CNNs with the same architecture were trained. CNN-NS was trained on 25,331 non-standardised images taken from the International Skin Imaging Collaboration using different image capture devices. CNN-S was trained on 235,268 standardised images and CNN-S2 was trained on 25,331 standardised images (matched for number and classes of training images to CNN-NS). Both standardised datasets (CNN-S and CNN-S2) were provided by Molemap using the same image capture device. 495 Danish patients with 569 images of skin lesions predominantly involving Fitzpatrick\'s skin types II and III were used to test the performance of the models. 4 tele-dermatologists independently diagnosed and assessed the images taken of the lesions. Primary outcome measures were sensitivity, specificity and area under the curve of the receiver operating characteristic (AUROC).
Results:
569 images were taken from 495 patients (280 women [57%], 215 men [43%]; mean age 55 years [17 SD]) for this study. On these images, CNN-S achieved an AUROC of 0.861 (CI 0.830 - 0.889; P<0.001) and CNN-S2 achieved an AUROC of 0.831 (CI 0.798 - 0.861; P=0.009), with both outperforming CNN-NS, which achieved an AUROC of 0.759 (CI 0.722 - 0.794; P<0.001, P=0.009) (Figure 1). When the CNNs were matched to the mean sensitivity and specificity of the tele-dermatologists, the model’s resultant sensitivities and specificities were surpassed by the tele-dermatologists (Table 1). However, when compared to CNN-S, the differences were not statistically significant (P=0.100, P=0.053). Performance across all CNN models as well as tele-dermatologists was influenced by image quality.
Conclusions:
CNNs trained on standardised images had improved performance and therefore greater generalisability in skin cancer classification when applied to an unseen dataset. This is an important consideration for future algorithm development, regulation and approval. Further, when tested on these unseen test images, the tele-dermatologists ‘clinically’ outperformed all the CNN models; however, the difference was deemed to be statistically insignificant when compared to CNN-S.
Citation

Per the author's request the PDF is not available.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.