Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Apr 3, 2023
Date Accepted: Oct 28, 2023

The final, peer-reviewed published version of this preprint can be found here:

Synthetic Tabular Data Based on Generative Adversarial Networks in Health Care: Generation and Validation Using the Divide-and-Conquer Strategy

Kang HYJ, Batbaatar E, Choi DW, Choi KS, Ko MS, Ryu KS

Synthetic Tabular Data Based on Generative Adversarial Networks in Health Care: Generation and Validation Using the Divide-and-Conquer Strategy

JMIR Med Inform 2023;11:e47859

DOI: 10.2196/47859

PMID: 37999942

PMCID: 10709788

Divide-and-Conquer: Generation and Validation of Synthetic Tabular Data based on Generative Adversarial Networks in Healthcare

  • Ha Ye Jin Kang; 
  • Erdenebileg Batbaatar; 
  • Dong-Woo Choi; 
  • Kui Son Choi; 
  • Min Sam Ko; 
  • Kwang Sun Ryu

ABSTRACT

Background:

Synthetic data generation (SDG) using generative adversarial networks (GANs) is used in healthcare, but research on preserving data with logical relation with synthetic tabular data (STD) has remained challenging. Previous studies have explored filtering methods for SDG, which can lead to the loss of important information.

Objective:

This study proposes the Divide-Conquer(DC) method to generate STD based on the GAN algorithm, while preserving data with logical relation.

Methods:

The DC-based SDG strategy comprises three steps. In the first step, we use two different partitioning methods. "Class-specific" distinguishes between survival and death groups. In addition, Cramer's V identifies the highest correlation between columns in the original data. In the second step, the entire data set is divided into a number of subsets, which are then used as input for CTGAN and CopulaGAN to generate synthetic data. In the third step, the generated synthetic data is consolidated into a single entity as the conquer step. For validation, we compared DC-based SDG and Conditional Sampling(CS)-based SDG through the performances of machine learning models. CS is a method used in CTGAN and CopulaGAN. CS is achieved through the use of rejection sampling, where rows are sampled repeatedly until the desired condition is met. The performance is also compared on balanced and unbalanced synthetic datasets.

Results:

The proposed method is evaluated on data from The Korea Association for Lung Cancer Registry (KALC-R) and two benchmark datasets (breast and diabetes). We compared the synthetic data generated by CTGAN or CopulaGAN to see which resulted in the highest performance. As a result, the synthetic data of the three diseases generated by our proposed model outperformed the four classifiers. Using a synthetic dataset of CTGAN and CopulaGAN with CS or DC, we compared the best model performance based on the area under the curve; decision tree (KALC-R, 74.87±0.77 (DC) vs 63.87±2.02(CS); breast, 73.31±1.11 (DC) vs 67.96±2.15(CS); diabetes, 61.57±0.09 (DC) vs 60.08±0.17(CS)), random forest (KALC-R, 85.61±0.29 (DC) vs 79.01±1.20(CS); breast, 78.05±1.59 (DC) vs 73.48±4.73 (CS); diabetes, 59.98±0.24 (DC) vs 58.55±0.17(CS)), extreme gradient-boosting (KALC-R, 85.20±0.82 (DC) vs 76.42±0.93(CS); breast, 77.86±2.27 (DC) vs 68.32±2.37(CS); diabetes, 60.18±0.20 (DC) vs 58.98±0.29(CS)), and light gradient-boosting machine (KALC-R, 85.14±0.77 (DC) vs 77.62±1.85(CS); breast, 78.16±1.52 (DC) vs 70.02±2.17(CS); diabetes, 61.75±0.13 (DC) vs 61.12±0.23(CS)). We also generated unbalanced and balanced synthetic data for each of the three datasets and compared their performance using DT, RF, XGBoost, and LGBM models, and found that balanced synthetic data performed better.

Conclusions:

Besides being the first attempt to generate and validate STDs based on a DC approach and shows improved performance. The necessity for balanced synthetic data generation is also demonstrated.


 Citation

Please cite as:

Kang HYJ, Batbaatar E, Choi DW, Choi KS, Ko MS, Ryu KS

Synthetic Tabular Data Based on Generative Adversarial Networks in Health Care: Generation and Validation Using the Divide-and-Conquer Strategy

JMIR Med Inform 2023;11:e47859

DOI: 10.2196/47859

PMID: 37999942

PMCID: 10709788

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.