Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Feb 21, 2025
Open Peer Review Period: Mar 3, 2025 - Apr 28, 2025
Date Accepted: Jun 17, 2025
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Label Accuracy in Electronic Health Records and Its Impact on Machine Learning Models for Early Prediction of Gestational Diabetes: 3-Step Retrospective Validation Study

Germaine M, O'Higgins AC, Egan B, Healy G

Label Accuracy in Electronic Health Records and Its Impact on Machine Learning Models for Early Prediction of Gestational Diabetes: 3-Step Retrospective Validation Study

JMIR Med Inform 2025;13:e72938

DOI: 10.2196/72938

PMID: 40854223

PMCID: 12377786

Gestational Diabetes Diagnoses in Electronic Health Records: A Three-Step Study of Label Accuracy and Its Impact on Machine Learning Models for Early Prediction

  • Mark Germaine; 
  • Amy C O'Higgins; 
  • Brendan Egan; 
  • Graham Healy

ABSTRACT

Background:

Integration of electronic health records (EHRs) into clinical research offers numerous opportunities for advancing healthcare delivery and patient outcomes, particularly in the era of machine learning (ML). However, EHR data needs to be coded accurately to ensure that models are learning correct representations of diseases.

Objective:

This study examines the accuracy of gestational diabetes mellitus (GDM) diagnoses in EHRs compared with a clinical team database (CTD) and their impact on ML models.

Methods:

EHRs from 2018-2022 were validated against CTD data to identify true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). Logistic regression (LR) models were trained and tested using both EHR and validated labels, whereafter simulated label noise was introduced to increase FP and FN rates. Model performance was assessed using Receiver Operating Characteristic Area Under the Curve (ROC-AUC) and average precision (AP).

Results:

Among 3,952 patients, 3,388 (85.7%) were correctly identified with GDM in both databases, while 564 cases lacked a GDM label in EHRs and 771 were missing a corresponding CTD label. Overall, 87.5% of cases were TN, 9.0% TP, 2.0% FP, and 1.5% FN. The model trained and tested with validated labels achieved a ROC-AUC of 0.817 and an AP of 0.450, whereas the same model tested using EHR labels achieved 0.814 and 0.395, respectively. Increased label noise during training led to gradual declines in ROC-AUC and AP, while noise in the test set -- especially elevated FP rates -- resulted in marked performance drops.

Conclusions:

Discrepancies between EHR and CTD diagnoses had limited impact on model training but significantly affected performance evaluation when present in the test set, emphasising the importance of accurate data validation.


 Citation

Please cite as:

Germaine M, O'Higgins AC, Egan B, Healy G

Label Accuracy in Electronic Health Records and Its Impact on Machine Learning Models for Early Prediction of Gestational Diabetes: 3-Step Retrospective Validation Study

JMIR Med Inform 2025;13:e72938

DOI: 10.2196/72938

PMID: 40854223

PMCID: 12377786

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.