Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Apr 12, 2022
Date Accepted: Dec 6, 2022

The final, peer-reviewed published version of this preprint can be found here:

Dealing With Missing, Imbalanced, and Sparse Features During the Development of a Prediction Model for Sudden Death Using Emergency Medicine Data: Machine Learning Approach

Chen X, Chen H, Nan S, Kong X, Duan H, Zhu H

Dealing With Missing, Imbalanced, and Sparse Features During the Development of a Prediction Model for Sudden Death Using Emergency Medicine Data: Machine Learning Approach

JMIR Med Inform 2023;11:e38590

DOI: 10.2196/38590

PMID: 36662548

PMCID: 9898833

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Dealing with the Missing, Imbalanced and Sparse Features Problems in Emergency Data Using Random Forest, K-means and PCA Respectively

  • Xiaojie Chen; 
  • Han Chen; 
  • Shan Nan; 
  • Xiangtian Kong; 
  • Huilong Duan; 
  • Haiyan Zhu

ABSTRACT

Background:

In emergency departments (ED), timely rescue is very important as patients’ conditions usually deteriorate rapidly. Early diagnosis can increase patients’ chances of survival. Early diagnosis can be improved by predictive models based on machine learning using Electronic Medical Record (EMR) data. However, ED data are usually imbalanced, having missing values and sparse features. These quality issues make it challenging to build early identification models for diseases in ED.

Objective:

The objective of this study is to propose a systematic approach to deal with missing, imbalanced and sparse feature problems of ED data.

Methods:

We used random forest and K-means algorithms to interpolate missing values and under-sample data. Regarding sparse features, we used principal component analysis to reduce dimensions. For continuous and discrete variables, the decision coefficient R2 and Kappa coefficient are used to evaluate the performance respectively. The area under the receiver operating characteristic curve (AUC) and the area under the precision-recall curve (AUPRC) are used to estimate the model performance. To further evaluate the proposed approach, we carried out a case study using an ED dataset extracted from Hainan Hospital of Chinese PLA General Hospital. A logistic regression model for patient condition worsening prediction was built out of the data processed by the proposed approach.

Results:

A total of 1085 patients with rescue record and 17959 patients without rescue record were collected, which were significantly imbalanced. 275, 402 and 891 variables are extracted from laboratory tests, medications and diagnosis, respectively. After data preprocessing, the median R2 of random forest interpolation for continuous variables is 0.623 (IQR: 0.647), and the median of Kappa coefficient for discrete variable interpolation is 0.444 (IQR: 0.285). The logistic regression model constructed using the initial diagnostic data has poor performance and variable separation, which is reflected in the abnormally high OR values of the two variables of cardiac arrest and respiratory arrest (27857.4 and 9341.6) and an abnormal confidence interval. Using the processed data, the recall of the model reaches 0.77, F1-SCORE is 0.74, and AUC is 0.64.

Conclusions:

We proposed a machine learning method to deal with data quality issues such as missing data, data imbalance, and sparse features in emergency data, so as to improve data availability. A preliminary case study indicate the results produced by the proposed method can be used for building prediction model for emergency patients.


 Citation

Please cite as:

Chen X, Chen H, Nan S, Kong X, Duan H, Zhu H

Dealing With Missing, Imbalanced, and Sparse Features During the Development of a Prediction Model for Sudden Death Using Emergency Medicine Data: Machine Learning Approach

JMIR Med Inform 2023;11:e38590

DOI: 10.2196/38590

PMID: 36662548

PMCID: 9898833

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.