JMIR Preprints #47859: Divide-and-Conquer: Generation and Validation of Synthetic Tabular Data based on Generative Adversarial Networks in Healthcare

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Divide-and-Conquer: Generation and Validation of Synthetic Tabular Data based on Generative Adversarial Networks in Healthcare

Ha Ye Jin Kang;
Erdenebileg Batbaatar;
Dong-Woo Choi;
Kui Son Choi;
Min Sam Ko;
Kwang Sun Ryu

ABSTRACT

Background:

Synthetic data generation (SDG) using generative adversarial networks (GANs) is used in healthcare, but research on preserving data with logical relation with synthetic tabular data (STD) has remained challenging. Previous studies have explored filtering methods for SDG, which can lead to the loss of important information.

Objective:

This study proposes the Divide-Conquer(DC) method to generate STD based on the GAN algorithm, while preserving data with logical relation.

Methods:

The DC-based SDG strategy comprises three steps. In the first step, we use two different partitioning methods. "Class-specific" distinguishes between survival and death groups. In addition, Cramer's V identifies the highest correlation between columns in the original data. In the second step, the entire data set is divided into a number of subsets, which are then used as input for CTGAN and CopulaGAN to generate synthetic data. In the third step, the generated synthetic data is consolidated into a single entity as the conquer step. For validation, we compared DC-based SDG and Conditional Sampling(CS)-based SDG through the performances of machine learning models. CS is a method used in CTGAN and CopulaGAN. CS is achieved through the use of rejection sampling, where rows are sampled repeatedly until the desired condition is met. The performance is also compared on balanced and unbalanced synthetic datasets.

Results:

The proposed method is evaluated on data from The Korea Association for Lung Cancer Registry (KALC-R) and two benchmark datasets (breast and diabetes). We compared the synthetic data generated by CTGAN or CopulaGAN to see which resulted in the highest performance. As a result, the synthetic data of the three diseases generated by our proposed model outperformed the four classifiers. Using a synthetic dataset of CTGAN and CopulaGAN with CS or DC, we compared the best model performance based on the area under the curve; decision tree (KALC-R, 74.87±0.77 (DC) vs 63.87±2.02(CS); breast, 73.31±1.11 (DC) vs 67.96±2.15(CS); diabetes, 61.57±0.09 (DC) vs 60.08±0.17(CS)), random forest (KALC-R, 85.61±0.29 (DC) vs 79.01±1.20(CS); breast, 78.05±1.59 (DC) vs 73.48±4.73 (CS); diabetes, 59.98±0.24 (DC) vs 58.55±0.17(CS)), extreme gradient-boosting (KALC-R, 85.20±0.82 (DC) vs 76.42±0.93(CS); breast, 77.86±2.27 (DC) vs 68.32±2.37(CS); diabetes, 60.18±0.20 (DC) vs 58.98±0.29(CS)), and light gradient-boosting machine (KALC-R, 85.14±0.77 (DC) vs 77.62±1.85(CS); breast, 78.16±1.52 (DC) vs 70.02±2.17(CS); diabetes, 61.75±0.13 (DC) vs 61.12±0.23(CS)). We also generated unbalanced and balanced synthetic data for each of the three datasets and compared their performance using DT, RF, XGBoost, and LGBM models, and found that balanced synthetic data performed better.

Conclusions:

Besides being the first attempt to generate and validate STDs based on a DC approach and shows improved performance. The necessity for balanced synthetic data generation is also demonstrated.

Citation

Please cite as:

Kang HYJ, Batbaatar E, Choi DW, Choi KS, Ko MS, Ryu KS

Synthetic Tabular Data Based on Generative Adversarial Networks in Health Care: Generation and Validation Using the Divide-and-Conquer Strategy

JMIR Med Inform 2023;11:e47859

DOI: 10.2196/47859

PMID: 37999942

PMCID: 10709788

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Apr 3, 2023

Date Accepted: Oct 28, 2023

Divide-and-Conquer: Generation and Validation of Synthetic Tabular Data based on Generative Adversarial Networks in Healthcare

ABSTRACT

Citation

Copyright