Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Mar 20, 2025
Date Accepted: Sep 4, 2025

The final, peer-reviewed published version of this preprint can be found here:

Synthetic Tabular Data Generation Under Horizontal Federated Learning Environments in Acute Myeloid Leukemia: Case-Based Simulation Study

Isasa I, Catalina M, Epelde G, Aginako N, Beristain A

Synthetic Tabular Data Generation Under Horizontal Federated Learning Environments in Acute Myeloid Leukemia: Case-Based Simulation Study

JMIR Med Inform 2025;13:e74116

DOI: 10.2196/74116

PMID: 41021276

PMCID: 12519032

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Synthetic tabular data generation in Federated Learning environments: A practical use case for Acute Myeloid Leukemia

  • Imanol Isasa; 
  • Mikel Catalina; 
  • Gorka Epelde; 
  • Naiara Aginako; 
  • Andoni Beristain

ABSTRACT

Background:

Data scarcity and dispersion pose significant obstacles in biomedical research, particularly when addressing rare diseases. In such scenarios, Synthetic Data Generation (SDG) has emerged as a promising path to mitigate the first issue. Concurrently, Federated Learning (FL) is a machine learning paradigm where multiple nodes collaborate to create a centralized model with knowledge that is distilled from the data in different nodes, but without the need for sharing it. This research explores the combination of SDG and FL technologies in the context of Acute Myeloid Leukemia, a rare hematological disorder, evaluating their combined impact and the quality of the generated artificial datasets.

Objective:

To evaluate the privacy- and fidelity-related impact of federating a SDG model in different data distribution scenarios and with different numbers of nodes, comparing them with a centralized baseline SDG model.

Methods:

A state-of-the-art Generative Adversarial Network architecture was trained considering four different scenarios: a (1) non-federated baseline with all the data available, a (2) federated scenario where the data was evenly distributed among different nodes, a (3) federated scenario where the data was unevenly and randomly distributed (imbalanced data), and a (4) federated scenario with non-IID data distributions. For each of the federated scenarios, a fixed set of node quantities (3, 5, 7, 10) was considered to assess its impact, and the generated data was evaluated attending to a fidelity-privacy trade-off.

Results:

The computed fidelity metrics exhibited statistically significant deteriorations (P < 0.001) ranging from 0.21% to 21.23% due to the federation process. When comparing federated experiments trained with diverse numbers of nodes, no strong tendencies were observed, even if specific comparisons resulted in significative differences. Privacy metrics were mainly maintained while obtaining maximum improvements of 55.17% and maximum deteriorations of 26.23, although they were not statistically significant.

Conclusions:

Within the scope of the use case scenario in this paper, the act of federating an SDG algorithm results in a loss of data fidelity compared to the non-federated baseline while maintaining privacy levels. However, this deterioration does not significantly increase as the number of nodes used to train the models grows, even though significative differences were found in specific comparisons. The fact that the amount of data was differently distributed was neither significant for most experiments nor metrics, as similar tendencies were found for all scenarios.


 Citation

Please cite as:

Isasa I, Catalina M, Epelde G, Aginako N, Beristain A

Synthetic Tabular Data Generation Under Horizontal Federated Learning Environments in Acute Myeloid Leukemia: Case-Based Simulation Study

JMIR Med Inform 2025;13:e74116

DOI: 10.2196/74116

PMID: 41021276

PMCID: 12519032

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.