Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Sep 11, 2023
Date Accepted: Mar 22, 2024
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Impact of Gold-Standard Label Errors on Evaluating Performance of Deep Learning Models in Diabetic Retinopathy Screening: Nationwide Real-World Validation Study

Wang Y, Li C, Luo L, Yin Q, Zhang J, Peng G, Shi D, Han X, He M

Impact of Gold-Standard Label Errors on Evaluating Performance of Deep Learning Models in Diabetic Retinopathy Screening: Nationwide Real-World Validation Study

J Med Internet Res 2024;26:e52506

DOI: 10.2196/52506

PMID: 39141915

PMCID: 11358665

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Impact of ground truth errors on evaluating performance of deep learning model in diabetic retinopathy screening: A National Real-world Validation Study

  • Yueye Wang; 
  • Cong Li; 
  • Lixia Luo; 
  • Qiuxia Yin; 
  • Jian Zhang; 
  • Guankai Peng; 
  • Danli Shi; 
  • Xiaotong Han; 
  • Mingguang He

ABSTRACT

Background:

In deep learning model training and validation, "ground truth" (GT) refers to the true and accurate labels or values that represent the correct answers or desired outputs for a given dataset. It serves as a reference or benchmark against which the model's predictions are compared.

Objective:

This study aims to assess the accuracy of a custom deep learning algorithm on classifying diabetic retinopathy and further demonstrate how the ground truth errors may contribute to this assessment in a national diabetic retinopathy (DR) screening program.

Methods:

Fundus photographs from the Lifeline Express, a nationwide DR screening program, were analyzed to identify the presence of referable DR by both (1) manual grading by English National Health Screening-certificated graders (defined as the GT) and (2) a DL-based DR screening algorithm with validated good lab performance. To assess the accuracy of the GT, a random sample of images with disagreement between the DL algorithm and GT were adjudicated by ophthalmologists who were masked to the previous grading results. The error rates of GT in this sample were then used to correct the number of negative and positive cases in the entire dataset, serving as the post-correction GT. The DL algorithm's performance was evaluated against both the pre- and post-correction GT.

Results:

The analysis included 736,083 images from 237,824 participants. The DL algorithm exhibited a gap between the reported lab performance and the real-world performance in this national dataset, with a sensitivity decrease of 12.9 % (92.5% vs. 79.6%, p<0.001) and a specificity decrease of 6.9% (98.5% vs. 91.6%, p<0.001). In the extracted sample with label discrepancies, 63.6% (560/880) of negative images and 5.17% (140/2,710) of positive images were misclassified in the pre-correction GT. High myopia was the primary reason for misclassifying R0 images as referable, while laser spot was predominantly responsible for misclassified referable cases. The estimated GT error rate for the entire dataset was 1.21%. After GT correction, the DL algorithm's performance gap in sensitivity was significantly reduced to 0.4% (p<0.001).

Conclusions:

The GT errors based on human image grading, despite in a small percentage, could significantly affect the performance of DL algorithms in real-world DR screening.


 Citation

Please cite as:

Wang Y, Li C, Luo L, Yin Q, Zhang J, Peng G, Shi D, Han X, He M

Impact of Gold-Standard Label Errors on Evaluating Performance of Deep Learning Models in Diabetic Retinopathy Screening: Nationwide Real-World Validation Study

J Med Internet Res 2024;26:e52506

DOI: 10.2196/52506

PMID: 39141915

PMCID: 11358665

Per the author's request the PDF is not available.