JMIR Preprints #30824: SQMI-R: Self-training with quantile errors for multivariate missing data imputation for regression problems in electronic medical records

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

SQMI-R: Self-training with quantile errors for multivariate missing data imputation for regression problems in electronic medical records

Hansle Gwon;
Imjin Ahn;
Yunha Kim;
Hee Jun Kang;
Hyeram Seo;
Ha Na Cho;
Heejung Choi;
Tae Joon Jun;
Young-Hak Kim

ABSTRACT

Background:

When using machine learning in the real world, the missing value problem is the first problem encountered. Methods to impute this missing value include statistical methods such as Mean, Expectation-Maximization (EM), and Multiple imputations by chained equations (MICE), and machine learning methods such as Multi-Layer-Perceptron (MLP), K-Nearest-Neighbor (KNN), and Decision-Tree (DT).

Objective:

The objective of this study was to impute numeric medical data such as physical data and laboratory data. We aim to effectively impute data in the medical field where training data is scarce using a progressive method called self-training.

Methods:

In this paper, we propose a self-training method that gradually increases the available data. Models trained with complete data predict the missing values of the incomplete data. Among the incomplete data, the data in which the missing value is validly predicted are incorporated into the complete data. Using the predicted value as the actual value is called pseudo-labeling. This process is repeated until the condition is satisfied. The most important part of this process is how to evaluate the accuracy of pseudo-labels. They can be evaluated by observing the effect of the pseudo-labeled data on the performance of the model.

Results:

Self-training showed lower Mean Squared Error (MSE) and higher Pearson correlation coefficient than conventional statistics and machine learning techniques in various situations. The results were statistically significant, and as we intended, they were more effective when there were many missing values.

Conclusions:

Self-training showed significant results in comparing the predicted values and actual values, but it needs to be verified in an actual machine learning system. And self-training has the potential to improve performance according to the pseudo-label evaluation method, which will be the main subject of our future research.

Citation

Please cite as:

Gwon H, Ahn I, Kim Y, Kang HJ, Seo H, Cho HN, Choi H, Jun TJ, Kim YH

Self–Training With Quantile Errors for Multivariate Missing Data Imputation for Regression Problems in Electronic Medical Records: Algorithm Development Study

JMIR Public Health Surveill 2021;7(10):e30824

DOI: 10.2196/30824

PMID: 34643539

PMCID: 8552097

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Public Health and Surveillance

Date Submitted: May 30, 2021

Date Accepted: Aug 2, 2021

SQMI-R: Self-training with quantile errors for multivariate missing data imputation for regression problems in electronic medical records

ABSTRACT

Citation

Copyright