Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Dec 3, 2023
Date Accepted: May 8, 2024
Date Submitted to PubMed: May 23, 2024

The final, peer-reviewed published version of this preprint can be found here:

Comparison of Synthetic Data Generation Techniques for Control Group Survival Data in Oncology Clinical Trials: Simulation Study

Akiya I, Ishihara T, Yamamoto K

Comparison of Synthetic Data Generation Techniques for Control Group Survival Data in Oncology Clinical Trials: Simulation Study

JMIR Med Inform 2024;12:e55118

DOI: 10.2196/55118

PMID: 38889082

PMCID: 11196245

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

A Comparison of Synthetic Data Generation Techniques for Control Group Survival Data in Oncology Clinical Trials: Simulation Study

  • Ippei Akiya; 
  • Takuma Ishihara; 
  • Keiichi Yamamoto

ABSTRACT

Introduction: Synthetic patient data (SPD) generation for survival analysis in oncology trials holds significant potential for accelerating clinical development. Various machine learning methods, including CART, random forest (RF), Bayesian network (BN), and CTGAN, have been employed for this purpose, but their performance in reflecting actual patient survival data remains under investigation. Method: Utilizing multiple clinical trial datasets, survival SPD was generated and evaluated using mean survival time (MST), hazard ratio distance (HRD), and visual analysis of Kaplan‒Meier (KM) plots. Each method's ability to mimic the statistical profile of real patient data was compared.

Results:

CART consistently demonstrated promising results across various evaluation metrics, outperforming other methods such as RF, BN, and CTGAN. However, while RF is known for its high generalization performance, CART exhibited closer resemblance to actual data, emphasizing the importance of similarity in SPD generation. Conclusion: It seems that the reason that CART indicated better similarity than RF is that the ensemble learning of RF prevents overfitting, and CART overfits SPD. In SPD generation, the statistical properties close to the actual data should be the focus, not a well-generalized prediction model. Both the BN and CTGAN methods cannot accurately reflect the statistical profile of the actual data, primarily due to the small datasets. As a method for generating SPD for survival data from small datasets, such as clinical trial data, CART is considered the most effective method. Additionally, it is necessary to improve CART-based generation methods by incorporating feature engineering and other methods in future work.


 Citation

Please cite as:

Akiya I, Ishihara T, Yamamoto K

Comparison of Synthetic Data Generation Techniques for Control Group Survival Data in Oncology Clinical Trials: Simulation Study

JMIR Med Inform 2024;12:e55118

DOI: 10.2196/55118

PMID: 38889082

PMCID: 11196245

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.