Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Sep 11, 2023
Date Accepted: Mar 22, 2024
(closed for review but you can still tweet)
Impact of "Golden Standard" Label Errors on Evaluating Performance of Deep Learning Model in Diabetic Retinopathy Screening: A National Real-world Validation Study
ABSTRACT
Background:
For medical artificial intelligence (AI) training and validation, human expert labels are considered as "golden standard" that represents the correct answers or desired outputs for a given dataset. It serves as a reference or benchmark against which the model's predictions are compared.
Objective:
This study aims to assess the accuracy of a custom deep learning (DL) algorithm on classifying diabetic retinopathy and further demonstrate how the label errors may contribute to this assessment in a national diabetic retinopathy (DR) screening program.
Methods:
Fundus photographs from the Lifeline Express, a nationwide DR screening program, were analyzed to identify the presence of referable DR by both (1) manual grading by English National Health Screening-certificated graders and (2) a DL-based DR screening algorithm with validated good lab performance. To assess the accuracy of the labels, a random sample of images with disagreement between the DL algorithm and labels were adjudicated by ophthalmologists who were masked to the previous grading results. The error rates of labels in this sample were then used to correct the number of negative and positive cases in the entire dataset, serving as the post-correction labels. The DL algorithm's performance was evaluated against both the pre- and post-correction labels.
Results:
The analysis included 736,083 images from 237,824 participants. The DL algorithm exhibited a gap between the reported lab performance and the real-world performance in this national dataset, with a sensitivity decrease of 12.9 % (92.5% vs. 79.6%, P<.001) and a specificity decrease of 6.9% (98.5% vs. 91.6%, P<.001). In the random sample, 63.6% (560/880) of negative images and 5.2% (140/2,710) of positive images were misclassified in the pre-correction human labels. High myopia was the primary reason for misclassifying R0 images as referable DR, while laser spot was predominantly responsible for misclassified referable cases. The estimated label error rate for the entire dataset was 1.2%. The label correction was estimated to bring about 12% enhancements in estimated sensitivity of the DL algorithm (P<.001).
Conclusions:
The label errors based on human image grading, despite in a small percentage, could significantly affect the performance evaluation of DL algorithms in real-world DR screening.
Citation
Request queued. Please wait while the file is being generated. It may take some time.