Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Sep 11, 2023
Date Accepted: Mar 22, 2024
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Impact of Gold-Standard Label Errors on Evaluating Performance of Deep Learning Models in Diabetic Retinopathy Screening: Nationwide Real-World Validation Study

Wang Y, Han X, Li C, Luo L, Yin Q, Zhang J, Peng G, Shi D, He M

Impact of Gold-Standard Label Errors on Evaluating Performance of Deep Learning Models in Diabetic Retinopathy Screening: Nationwide Real-World Validation Study

J Med Internet Res 2024;26:e52506

DOI: 10.2196/52506

PMID: 39141915

PMCID: 11358665

Impact of "Golden Standard" Label Errors on Evaluating Performance of Deep Learning Model in Diabetic Retinopathy Screening: A National Real-world Validation Study

  • Yueye Wang; 
  • Xiaotong Han; 
  • Cong Li; 
  • Lixia Luo; 
  • Qiuxia Yin; 
  • Jian Zhang; 
  • Guankai Peng; 
  • Danli Shi; 
  • Mingguang He

ABSTRACT

Background:

For medical artificial intelligence (AI) training and validation, human expert labels are considered as "golden standard" that represents the correct answers or desired outputs for a given dataset. It serves as a reference or benchmark against which the model's predictions are compared.

Objective:

This study aims to assess the accuracy of a custom deep learning (DL) algorithm on classifying diabetic retinopathy and further demonstrate how the label errors may contribute to this assessment in a national diabetic retinopathy (DR) screening program.

Methods:

Fundus photographs from the Lifeline Express, a nationwide DR screening program, were analyzed to identify the presence of referable DR by both (1) manual grading by English National Health Screening-certificated graders and (2) a DL-based DR screening algorithm with validated good lab performance. To assess the accuracy of the labels, a random sample of images with disagreement between the DL algorithm and labels were adjudicated by ophthalmologists who were masked to the previous grading results. The error rates of labels in this sample were then used to correct the number of negative and positive cases in the entire dataset, serving as the post-correction labels. The DL algorithm's performance was evaluated against both the pre- and post-correction labels.

Results:

The analysis included 736,083 images from 237,824 participants. The DL algorithm exhibited a gap between the reported lab performance and the real-world performance in this national dataset, with a sensitivity decrease of 12.9 % (92.5% vs. 79.6%, P<.001) and a specificity decrease of 6.9% (98.5% vs. 91.6%, P<.001). In the random sample, 63.6% (560/880) of negative images and 5.2% (140/2,710) of positive images were misclassified in the pre-correction human labels. High myopia was the primary reason for misclassifying R0 images as referable DR, while laser spot was predominantly responsible for misclassified referable cases. The estimated label error rate for the entire dataset was 1.2%. The label correction was estimated to bring about 12% enhancements in estimated sensitivity of the DL algorithm (P<.001).

Conclusions:

The label errors based on human image grading, despite in a small percentage, could significantly affect the performance evaluation of DL algorithms in real-world DR screening.


 Citation

Please cite as:

Wang Y, Han X, Li C, Luo L, Yin Q, Zhang J, Peng G, Shi D, He M

Impact of Gold-Standard Label Errors on Evaluating Performance of Deep Learning Models in Diabetic Retinopathy Screening: Nationwide Real-World Validation Study

J Med Internet Res 2024;26:e52506

DOI: 10.2196/52506

PMID: 39141915

PMCID: 11358665

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.