Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Public Health and Surveillance

Date Submitted: May 30, 2021
Date Accepted: Aug 2, 2021

The final, peer-reviewed published version of this preprint can be found here:

Self–Training With Quantile Errors for Multivariate Missing Data Imputation for Regression Problems in Electronic Medical Records: Algorithm Development Study

Gwon H, Ahn I, Kim Y, Kang HJ, Seo H, Cho HN, Choi H, Jun TJ, Kim YH

Self–Training With Quantile Errors for Multivariate Missing Data Imputation for Regression Problems in Electronic Medical Records: Algorithm Development Study

JMIR Public Health Surveill 2021;7(10):e30824

DOI: 10.2196/30824

PMID: 34643539

PMCID: 8552097

SQMI-R: Self-training with quantile errors for multivariate missing data imputation for regression problems in electronic medical records

  • Hansle Gwon; 
  • Imjin Ahn; 
  • Yunha Kim; 
  • Hee Jun Kang; 
  • Hyeram Seo; 
  • Ha Na Cho; 
  • Heejung Choi; 
  • Tae Joon Jun; 
  • Young-Hak Kim

ABSTRACT

Background:

When using machine learning in the real world, the missing value problem is the first problem encountered. Methods to impute this missing value include statistical methods such as Mean, Expectation-Maximization (EM), and Multiple imputations by chained equations (MICE), and machine learning methods such as Multi-Layer-Perceptron (MLP), K-Nearest-Neighbor (KNN), and Decision-Tree (DT).

Objective:

The objective of this study was to impute numeric medical data such as physical data and laboratory data. We aim to effectively impute data in the medical field where training data is scarce using a progressive method called self-training.

Methods:

In this paper, we propose a self-training method that gradually increases the available data. Models trained with complete data predict the missing values of the incomplete data. Among the incomplete data, the data in which the missing value is validly predicted are incorporated into the complete data. Using the predicted value as the actual value is called pseudo-labeling. This process is repeated until the condition is satisfied. The most important part of this process is how to evaluate the accuracy of pseudo-labels. They can be evaluated by observing the effect of the pseudo-labeled data on the performance of the model.

Results:

Self-training showed lower Mean Squared Error (MSE) and higher Pearson correlation coefficient than conventional statistics and machine learning techniques in various situations. The results were statistically significant, and as we intended, they were more effective when there were many missing values.

Conclusions:

Self-training showed significant results in comparing the predicted values and actual values, but it needs to be verified in an actual machine learning system. And self-training has the potential to improve performance according to the pseudo-label evaluation method, which will be the main subject of our future research.


 Citation

Please cite as:

Gwon H, Ahn I, Kim Y, Kang HJ, Seo H, Cho HN, Choi H, Jun TJ, Kim YH

Self–Training With Quantile Errors for Multivariate Missing Data Imputation for Regression Problems in Electronic Medical Records: Algorithm Development Study

JMIR Public Health Surveill 2021;7(10):e30824

DOI: 10.2196/30824

PMID: 34643539

PMCID: 8552097

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.