JMIR Preprints #79307: Methods for Addressing Missingness in Electronic Health Record Data for Clinical Prediction Models: Comparative Evaluation

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Methods for Addressing Missingness in Electronic Health Record Data for Clinical Prediction Models: Comparative Evaluation

Jean Digitale;
Deborah Franzon;
Mark J. Pletcher;
Charles E. McCulloch;
Efstathios D. Gennatas

ABSTRACT

Background:

Missing data is a common challenge in EHR-based prediction modeling. Traditional imputation methods may not suit prediction or machine learning models, and real-world use requires workflows that are implementable for both model development and real-time prediction.

Objective:

We evaluated methods for handling missing data when using EHR data to build clinical prediction models in pediatric intensive care unit (PICU) patients.

Methods:

Using EHR data containing missing values from an academic medical center PICU, we generated a synthetic complete dataset. From this, we created 300 datasets with missing data under varying mechanisms and proportions of missingness for the outcomes of 1) successful extubation (binary) and 2) blood pressure (continuous). We assessed strategies to address missing data including simple methods (e.g., last observation carried forward [LOCF]), complex methods (e.g., random forest multiple imputation), and native support for missing values in outcome prediction models.

Results:

Across 886 patients and 1,220 intubation events, 18.2% of original data were missing. LOCF had the lowest imputation error, followed by random forest imputation (average mean squared error [MSE] improvement over mean imputation: 0.41 [range: 0.30, 0.50] and 0.33 [0.21, 0.43], respectively). LOCF generally outperformed other imputation methods across outcome metrics and models (mean improvement: 1.28% [range: -0.07%, 7.2%]). Imputation methods showed more performance variability for the binary outcome (balanced accuracy coefficient of variation [CV]: 0.042) than the continuous outcome (MSE CV: 0.001).

Conclusions:

Traditional imputation methods for inferential statistics, such multiple imputation, may not be optimal for prediction models. Amount of missingness influenced performance more than missingness mechanism. In datasets with frequent measurements, LOCF and native support for missing values in machine learning models offer reasonable performance for handing missingness at minimal computational cost in predictive analyses.

Citation

Please cite as:

Digitale J, Franzon D, Pletcher MJ, McCulloch CE, Gennatas ED

Methods for Addressing Missingness in Electronic Health Record Data for Clinical Prediction Models: Comparative Evaluation

JMIR Med Inform 2025;13:e79307

DOI: 10.2196/79307

PMID: 41237368

PMCID: 12617989

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Jun 18, 2025

Date Accepted: Oct 13, 2025

Methods for Addressing Missingness in Electronic Health Record Data for Clinical Prediction Models: Comparative Evaluation

ABSTRACT

Citation

Copyright