Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Jul 15, 2024
Date Accepted: Feb 8, 2025

The final, peer-reviewed published version of this preprint can be found here:

Imputation and Missing Indicators for Handling Missing Longitudinal Data: Data Simulation Analysis Based on Electronic Health Record Data

Ehrig M, Bullock GS, Leng XI, Pajewski NM, Speiser JL

Imputation and Missing Indicators for Handling Missing Longitudinal Data: Data Simulation Analysis Based on Electronic Health Record Data

JMIR Med Inform 2025;13:e64354

DOI: 10.2196/64354

PMID: 40080075

PMCID: 11924964

Imputation and missing indicators for handling missing longitudinal data: A simulation study based on electronic health record data

  • Molly Ehrig; 
  • Garrett S Bullock; 
  • Xiaoyan Iris Leng; 
  • Nicholas M Pajewski; 
  • Jaime Lynn Speiser

ABSTRACT

Background:

Missing data in electronic health records (EHRs) is highly prevalent and results in analytical concerns such as heterogeneous sources of bias and loss of statistical power. One simple analytic method for addressing missing or unknown covariate values is to treat missing-ness for a particular variable as a category onto itself, which we refer to as the missing indicator method. For cross-sectional analyses, recent work suggested that there was minimal benefit to the missing indicator method; however, it is unclear how this approach performs in the setting of longitudinal data, in which correlation among clustered repeated measures may be leveraged for potentially improved model performance.

Objective:

We aimed to assess the missing indicator method for longitudinal, repeated measures data using a simulation study mimicking real-world EHR data.

Methods:

We conducted a simulation study aimed to evaluate whether the missing indicator method improved model performance and imputation accuracy for longitudinal data mimicking an application of developing a clinical prediction model for falls in older adults based on EHR data. We simulated a longitudinal binary outcome using mixed effects logistic regression that emulated a falls assessment at annual follow-up visits. Using multivariate imputation by chained equations, we simulated time-invariant predictors such as sex and medical history, as well as dynamic predictors such as physical function, body mass index, and medication use. We induced missing data in predictors under scenarios that had both random (MAR) and dependent missing-ness (MNAR). We evaluated aggregate performance using the area under the curve for models with and without missing indicators as predictors, as well as complete case analysis, across simulation replicates. We evaluated imputation quality using normalized root mean square error for continuous variables, and percent falsely classified for categorical variables.

Results:

Independent of the mechanism used to simulate missing data (MAR or MNAR), overall model performance via area under the curve was similar regardless of whether missing indicators were included in the model. The root mean square error and percent falsely classified measures were similar for models including missing indicators versus those without missing indicators. Model performance and imputation quality were similar regardless of whether the outcome was related to missingness. Imputation with or without missing indicators had similar mean values of area under the curve compared to complete case analysis, although complete case analysis had the largest range of values.

Conclusions:

The results of this study suggest that the inclusion of missing indicators in longitudinal data modeling neither improve nor worsen overall performance or imputation accuracy. Future research is needed to address whether the inclusion of missing indicators is useful in prediction modeling with longitudinal data in different settings, such as high dimensional data analysis.


 Citation

Please cite as:

Ehrig M, Bullock GS, Leng XI, Pajewski NM, Speiser JL

Imputation and Missing Indicators for Handling Missing Longitudinal Data: Data Simulation Analysis Based on Electronic Health Record Data

JMIR Med Inform 2025;13:e64354

DOI: 10.2196/64354

PMID: 40080075

PMCID: 11924964

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.