JMIR Preprints #55118: A Comparison of Synthetic Data Generation Techniques for Control Group Survival Data in Oncology Clinical Trials: Simulation Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

A Comparison of Synthetic Data Generation Techniques for Control Group Survival Data in Oncology Clinical Trials: Simulation Study

Ippei Akiya;
Takuma Ishihara;
Keiichi Yamamoto

ABSTRACT

Background:

Synthetic patient data (SPD) generation for survival analysis in oncology trials holds significant potential for accelerating clinical development. Various machine learning methods, including classification and regression trees (CART), random forest (RF), Bayesian network (BN), and CTGAN, have been employed for this purpose, but their performance in reflecting actual patient survival data remains under investigation.

Objective:

The aim of this study was to determine the most suitable SPD generation method for oncology trials, specifically focusing on both progression free survival (PFS) and overall survival (OS), which are the primary evaluation endpoints in oncology trials. To achieve this goal, we conducted a comparative simulation of 4 generation methods: CART, RF, BN, and the CTGAN, and the performance of each method was evaluated.

Methods:

Using multiple clinical trial datasets, 1000 datasets were generated by using each method for each clinical trial dataset and evaluated as follows: 1) mean survival time (MST) of PFS and OS, 2) hazard ratio distance (HRD), which indicates the similarity between the actual survival function and a synthetic survival function, and 3) visual analysis of Kaplan‒Meier (KM) plots. Each method's ability to mimic the statistical properties of real patient data was evaluated from these multiple angles.

Results:

In most simulation cases, CART demonstrated the high percentages of MSTSs falling within the range of 95% confidence interval (CI) of the MSTA. These percentages ranged from 88.8% to 98.0% for PFS and from 60.8% to 96.1% for OS. In the evaluation of HRD, CART demonstrated that HRD values were concentrated at approximately 0.9. Conversely, for the other methods, no consistent trend was observed for either PFS or OS. The reason why CART demonstrated better similarity than RF was that CART caused overfitting and RF, which is a kind of ensemble learning, prevented it. In SPD generation, the statistical properties close to the actual data should be the focus, not a well-generalized prediction model. Both the BN and CTGAN methods cannot accurately reflect the statistical properties of the actual data because small datasets are not suitable.

Conclusions:

As a method for generating SPD for survival data from small datasets, such as clinical trial data, CART demonstrated to be the most effective method compared to RF, BN, and CTGAN. Additionally, it is possible to improve CART-based generation methods by incorporating feature engineering and other methods in future work.

Citation

Please cite as:

Akiya I, Ishihara T, Yamamoto K

Comparison of Synthetic Data Generation Techniques for Control Group Survival Data in Oncology Clinical Trials: Simulation Study

JMIR Med Inform 2024;12:e55118

DOI: 10.2196/55118

PMID: 38889082

PMCID: 11196245

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Dec 3, 2023

Date Accepted: May 8, 2024

Date Submitted to PubMed: May 23, 2024

A Comparison of Synthetic Data Generation Techniques for Control Group Survival Data in Oncology Clinical Trials: Simulation Study

ABSTRACT

Citation

Copyright