Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Jun 18, 2025
Date Accepted: Oct 13, 2025
Methods for Addressing Missingness in Electronic Health Record Data for Clinical Prediction Models: Comparative Evaluation
ABSTRACT
Background:
Missing data is a common challenge in EHR-based prediction modeling. Traditional imputation methods may not suit prediction or machine learning models, and real-world use requires workflows that are implementable for both model development and real-time prediction.
Objective:
We evaluated methods for handling missing data when using EHR data to build clinical prediction models in pediatric intensive care unit (PICU) patients.
Methods:
Using EHR data containing missing values from an academic medical center PICU, we generated a synthetic complete dataset. From this, we created 300 datasets with missing data under varying mechanisms and proportions of missingness for the outcomes of 1) successful extubation (binary) and 2) blood pressure (continuous). We assessed strategies to address missing data including simple methods (e.g., last observation carried forward [LOCF]), complex methods (e.g., random forest multiple imputation), and native support for missing values in outcome prediction models.
Results:
Across 886 patients and 1,220 intubation events, 18.2% of original data were missing. LOCF had the lowest imputation error, followed by random forest imputation (average mean squared error [MSE] improvement over mean imputation: 0.41 [range: 0.30, 0.50] and 0.33 [0.21, 0.43], respectively). LOCF generally outperformed other imputation methods across outcome metrics and models (mean improvement: 1.28% [range: -0.07%, 7.2%]). Imputation methods showed more performance variability for the binary outcome (balanced accuracy coefficient of variation [CV]: 0.042) than the continuous outcome (MSE CV: 0.001).
Conclusions:
Traditional imputation methods for inferential statistics, such multiple imputation, may not be optimal for prediction models. Amount of missingness influenced performance more than missingness mechanism. In datasets with frequent measurements, LOCF and native support for missing values in machine learning models offer reasonable performance for handing missingness at minimal computational cost in predictive analyses.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.