JMIR Preprints #52506: Impact of ground truth errors on evaluating performance of deep learning model in diabetic retinopathy screening: A National Real-world Validation Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Impact of ground truth errors on evaluating performance of deep learning model in diabetic retinopathy screening: A National Real-world Validation Study

Yueye Wang;
Cong Li;
Lixia Luo;
Qiuxia Yin;
Jian Zhang;
Guankai Peng;
Danli Shi;
Xiaotong Han;
Mingguang He

ABSTRACT

Background:

In deep learning model training and validation, "ground truth" (GT) refers to the true and accurate labels or values that represent the correct answers or desired outputs for a given dataset. It serves as a reference or benchmark against which the model's predictions are compared.

Objective:

This study aims to assess the accuracy of a custom deep learning algorithm on classifying diabetic retinopathy and further demonstrate how the ground truth errors may contribute to this assessment in a national diabetic retinopathy (DR) screening program.

Methods:

Fundus photographs from the Lifeline Express, a nationwide DR screening program, were analyzed to identify the presence of referable DR by both (1) manual grading by English National Health Screening-certificated graders (defined as the GT) and (2) a DL-based DR screening algorithm with validated good lab performance. To assess the accuracy of the GT, a random sample of images with disagreement between the DL algorithm and GT were adjudicated by ophthalmologists who were masked to the previous grading results. The error rates of GT in this sample were then used to correct the number of negative and positive cases in the entire dataset, serving as the post-correction GT. The DL algorithm's performance was evaluated against both the pre- and post-correction GT.

Results:

The analysis included 736,083 images from 237,824 participants. The DL algorithm exhibited a gap between the reported lab performance and the real-world performance in this national dataset, with a sensitivity decrease of 12.9 % (92.5% vs. 79.6%, p<0.001) and a specificity decrease of 6.9% (98.5% vs. 91.6%, p<0.001). In the extracted sample with label discrepancies, 63.6% (560/880) of negative images and 5.17% (140/2,710) of positive images were misclassified in the pre-correction GT. High myopia was the primary reason for misclassifying R0 images as referable, while laser spot was predominantly responsible for misclassified referable cases. The estimated GT error rate for the entire dataset was 1.21%. After GT correction, the DL algorithm's performance gap in sensitivity was significantly reduced to 0.4% (p<0.001).

Conclusions:

The GT errors based on human image grading, despite in a small percentage, could significantly affect the performance of DL algorithms in real-world DR screening.

Citation

Please cite as:

Wang Y, Li C, Luo L, Yin Q, Zhang J, Peng G, Shi D, Han X, He M

Impact of Gold-Standard Label Errors on Evaluating Performance of Deep Learning Models in Diabetic Retinopathy Screening: Nationwide Real-World Validation Study

J Med Internet Res 2024;26:e52506

DOI: 10.2196/52506

PMID: 39141915

PMCID: 11358665