Accepted for/Published in: JMIR Public Health and Surveillance
Date Submitted: May 30, 2021
Date Accepted: Aug 2, 2021
SQMI-R: Self-training with quantile errors for multivariate missing data imputation for regression problems in electronic medical records
ABSTRACT
Background:
When using machine learning in the real world, the missing value problem is the first problem encountered. Methods to impute this missing value include statistical methods such as Mean, Expectation-Maximization (EM), and Multiple imputations by chained equations (MICE), and machine learning methods such as Multi-Layer-Perceptron (MLP), K-Nearest-Neighbor (KNN), and Decision-Tree (DT).
Objective:
The objective of this study was to impute numeric medical data such as physical data and laboratory data. We aim to effectively impute data in the medical field where training data is scarce using a progressive method called self-training.
Methods:
In this paper, we propose a self-training method that gradually increases the available data. Models trained with complete data predict the missing values of the incomplete data. Among the incomplete data, the data in which the missing value is validly predicted are incorporated into the complete data. Using the predicted value as the actual value is called pseudo-labeling. This process is repeated until the condition is satisfied. The most important part of this process is how to evaluate the accuracy of pseudo-labels. They can be evaluated by observing the effect of the pseudo-labeled data on the performance of the model.
Results:
Self-training showed lower Mean Squared Error (MSE) and higher Pearson correlation coefficient than conventional statistics and machine learning techniques in various situations. The results were statistically significant, and as we intended, they were more effective when there were many missing values.
Conclusions:
Self-training showed significant results in comparing the predicted values and actual values, but it needs to be verified in an actual machine learning system. And self-training has the potential to improve performance according to the pseudo-label evaluation method, which will be the main subject of our future research.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.