Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Apr 4, 2023
Date Accepted: May 12, 2023
Improving the EHR-based Clinical Prediction Model Under Label Deficiency: A Network-based Generative Adversarial Semisupervised Approach
ABSTRACT
Background:
Observational biomedical studies facilitate a new strategy for large-scale electronic health record (EHR) utilization to support precision medicine. However, the data label inaccessibility is an increasingly important issue in clinical prediction despite employing synthetic and semisupervised learning from data. Little research has aimed to uncover the underlying graphical structure of EHRs.
Objective:
A network-based generative adversarial semisupervised method is proposed. The objective is to train clinical prediction models on label-deficient EHRs to achieve comparable learning performance to supervised methods.
Methods:
Three public datasets and one colorectal cancer dataset gathered from the Second Affiliated Hospital of Zhejiang University are selected as benchmarks. The proposed models are trained on 5% to 25% labeled data and evaluated on classification metrics against conventional semisupervised and supervised methods. The data quality, model security, and memory scalability are also evaluated.
Results:
The proposed method for semisupervised classification outperforms related semisupervised methods under the same setup, with the average AUCs reaching 0.945, 0.673, 0.611, and 0.588, followed by graph-based semisupervised learning (0.450, 0.454, 0.425, 0.5676) and label propagation (0.475,0.344, 0.440, 0.477). The average classification AUCs with 10% labeled data are 0.929, 0.719, 0.652, and 0.650, comparable to that of the supervised learning methods logistic regression (0.601, 0.670, 0.731, 0.710), support vector machines (0.733, 0.720, 0.720, 0.721), and random forests (0.982, 0.750, 0.758, 0.740). The concerns regarding the secondary use of data and data security are alleviated by realistic data synthesis and robust privacy preservation.
Conclusions:
Training clinical prediction models on label-deficient EHRs is indispensable in data-driven research. The proposed method has great potential to exploit the intrinsic structure of EHRs and achieve comparable learning performance to supervised methods.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.