JMIR Preprints #92164: Generating Synthetic Intensive Care Unit Patient Records Using Adversarially Filtered Large Language Models: Development and Evaluation Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Generating Synthetic Intensive Care Unit Patient Records Using Adversarially Filtered Large Language Models: Development and Evaluation Study

Amine Chouaki;
Bertrand Delezoide;
Negin Heidarifard;
Joelle Malaab;
Nathalie Texier;
Stéphane Schuck

ABSTRACT

Background:

Access to real-world electronic health records (EHRs) is limited by privacy regulations, governance constraints, data heterogeneity, and other factors that have been widely reported in the literature, collectively hindering the scalability of AI-driven clinical research. Synthetic health data have therefore emerged as a potential remedy, yet existing generative approaches often fail to ensure both statistical fidelity and clinical plausibility.

Objective:

This study aimed to develop and evaluate a novel framework for generating synthetic ICU patient records that balance statistical realism with medical consistency. We focus on distributional fidelity and clinical coherence as prerequisites for safe cohort augmentation, enabling privacy-preserving applications in healthcare AI.

Methods:

We designed a two-stage generation pipeline combining a prompt-based LLM with a post-hoc adversarial filtering mechanism. The LLM generated patient profiles in structured text format, which were parsed into tabular data before being filtered by a one-class discriminator (XGBoost) trained exclusively on real patient records. Records were retained based on a learned realism score. The framework was evaluated on the MIMIC-III dataset (n ≈ 40,000 patients, 20 selected clinical variables). Performance was compared with CTGAN and baseline LLM generation using (1) distributional similarity metrics [Kolmogorov–Smirnov (KS), total variation distance (TVD)], (2) inter-variable correlation preservation (Frobenius norm, mean correlation matrix distance), (3) rule-based medical consistency checks, and (4) downstream predictive tasks with Random Forest and LightGBM.

Results:

The adversarially filtered LLM demonstrated improved distributional alignment and preservation of inter-variable structure compared with CTGAN and baseline LLM generation, while eliminating medically implausible records identified in GAN-based outputs. In downstream predictive tasks, models trained on synthetic data generated by the proposed framework achieved performance comparable to models trained on real data when evaluated on an independent real test set.

Conclusions:

Citation

Please cite as:

Chouaki A, Delezoide B, Heidarifard N, Malaab J, Texier N, Schuck S

Generating Synthetic Intensive Care Unit Patient Records Using Adversarially Filtered Large Language Models: Development and Evaluation Study

JMIR Preprints. 26/01/2026:92164

DOI: 10.2196/preprints.92164

URL: https://preprints.jmir.org/preprint/92164

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Currently submitted to: JMIR AI

Date Submitted: Jan 26, 2026

Generating Synthetic Intensive Care Unit Patient Records Using Adversarially Filtered Large Language Models: Development and Evaluation Study

ABSTRACT

Citation

Copyright