Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Sep 11, 2023
Date Accepted: Mar 22, 2024
(closed for review but you can still tweet)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Impact of ground truth errors on evaluating performance of deep learning model in diabetic retinopathy screening: A National Real-world Validation Study
ABSTRACT
Background:
In deep learning model training and validation, "ground truth" (GT) refers to the true and accurate labels or values that represent the correct answers or desired outputs for a given dataset. It serves as a reference or benchmark against which the model's predictions are compared.
Objective:
This study aims to assess the accuracy of a custom deep learning algorithm on classifying diabetic retinopathy and further demonstrate how the ground truth errors may contribute to this assessment in a national diabetic retinopathy (DR) screening program.
Methods:
Fundus photographs from the Lifeline Express, a nationwide DR screening program, were analyzed to identify the presence of referable DR by both (1) manual grading by English National Health Screening-certificated graders (defined as the GT) and (2) a DL-based DR screening algorithm with validated good lab performance. To assess the accuracy of the GT, a random sample of images with disagreement between the DL algorithm and GT were adjudicated by ophthalmologists who were masked to the previous grading results. The error rates of GT in this sample were then used to correct the number of negative and positive cases in the entire dataset, serving as the post-correction GT. The DL algorithm's performance was evaluated against both the pre- and post-correction GT.
Results:
The analysis included 736,083 images from 237,824 participants. The DL algorithm exhibited a gap between the reported lab performance and the real-world performance in this national dataset, with a sensitivity decrease of 12.9 % (92.5% vs. 79.6%, p<0.001) and a specificity decrease of 6.9% (98.5% vs. 91.6%, p<0.001). In the extracted sample with label discrepancies, 63.6% (560/880) of negative images and 5.17% (140/2,710) of positive images were misclassified in the pre-correction GT. High myopia was the primary reason for misclassifying R0 images as referable, while laser spot was predominantly responsible for misclassified referable cases. The estimated GT error rate for the entire dataset was 1.21%. After GT correction, the DL algorithm's performance gap in sensitivity was significantly reduced to 0.4% (p<0.001).
Conclusions:
The GT errors based on human image grading, despite in a small percentage, could significantly affect the performance of DL algorithms in real-world DR screening.
Citation