Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Apr 3, 2023
Date Accepted: Oct 28, 2023

The final, peer-reviewed published version of this preprint can be found here:

Synthetic Tabular Data Based on Generative Adversarial Networks in Health Care: Generation and Validation Using the Divide-and-Conquer Strategy

Ryu KS, Kang HYJ, Batbaatar E, Choi DW, Choi KS, Ko MS

Synthetic Tabular Data Based on Generative Adversarial Networks in Health Care: Generation and Validation Using the Divide-and-Conquer Strategy

JMIR Med Inform 2023;11:e47859

DOI: 10.2196/47859

PMID: 37999942

PMCID: 10709788

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Divide-and-Conquer: Generation and Validation of Synthetic Tabular Data based on Generative Adversarial Networks in Healthcare

  • Kwang Sun Ryu; 
  • Ha Ye Jin Kang; 
  • Erdenebileg Batbaatar; 
  • Dong-Woo Choi; 
  • Kui Son Choi; 
  • Min Sam Ko

ABSTRACT

Background:

Synthetic data generation (SDG) based on generative adversarial networks (GANs) has garnered significant attention in healthcare in the context of various tasks. However, minimal research has been conducted on SDG that preserves logical relationships and appropriates synthetic tabular data (STD) for learning in healthcare. Several researchers have studied SDG based on filtering methods—however, record selection using such methods depends only on predefined condition columns, which may induce the exclusion of meaningful information.

Objective:

The purpose of this study is to propose a divide-and-conquer (DC) approach and use it to generate STD, which preserves logical relationships for model learning based on GAN algorithms.

Methods:

The DC-based SDG strategy comprises four primary components. First, we define the division criteria for training. The first criterion is “class-specific“, i.e., it depends on the class between survival and death groups. The second criterion uses the “Cramer’s V” correlation measure, which identifies the highest correlation between columns in the original data (OD). Subsequently, the entire dataset is divided into several subsets following the aforementioned definition. Then, CTGAN and CopulaGAN are trained on the two divided data subsets to generate synthetic data. Finally, the generated synthetic data are combined into a single entity in the conquer step. For validation, the prediction performances of decision tree (DT), random forest (RF), extreme gradient-boosting (XGB), and light gradient-boosting machine (LGBM) are compared with the proposed approach and the conditional sampling (CS) approach of CTGAN and CopulaGAN. Also, prediction performances are compared on balanced synthetic and imbalanced synthetic datasets.

Results:

The experimental results reveal that the proposed model exhibits more accurate prediction performance with respect to the OD than SDG generated using existing methods. DC-based synthetic data is higher quality than synthetic data produced via CS as per the classification methods; DT: CTGAN (DC: 74.5 ± 1.2 vs CS: 60.0 ± 1.3), CopulaGAN (DC: 74.9 ± 0.8 vs CS: 70.5 ± 0.8), and OD (66.1 ± 1.3); RF: CTGAN (DC: 85.6 ± 0.3 vs CS: 79.0 ± 1.2), CopulaGAN (DC: 83.9 ± 0.4 vs CS: 78.2 ± 1.7), and OD(84.8 ± 0.2); XGB: CTGAN (DC: 85.2 ± 0.8 vs CS: 74.7 ± 1.6), CopulaGAN (DC: 83.6 ± 0.7 vs CS: 76.4 ± 0.9), and OD (83.1 ± 0.4); LGBM: CTGAN (DC: 85.2 ± 0.6 vs CS: 77.8 ± 1.5), CopulaGAN (DC: 83.7 ± 0.5 vs CS: 77.6 ± 1.9, and OD (84.0 ± 0.0). Moreover, models with balanced STDs outperform those withoutperform those with imbalanced STDs.

Conclusions:

Besides being the first attempt to generate and validate STDs based on a DC approach while preserving logical relationships, this study demonstrates that the proposed method exhibits improved performance. The necessity for balanced synthetic data generation is also demonstrated.


 Citation

Please cite as:

Ryu KS, Kang HYJ, Batbaatar E, Choi DW, Choi KS, Ko MS

Synthetic Tabular Data Based on Generative Adversarial Networks in Health Care: Generation and Validation Using the Divide-and-Conquer Strategy

JMIR Med Inform 2023;11:e47859

DOI: 10.2196/47859

PMID: 37999942

PMCID: 10709788

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.