Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Nov 29, 2025
Date Accepted: May 14, 2026
An evaluation of pre-trained generative models for augmenting small health data
ABSTRACT
Background:
Synthetic Data Generation (SDG) has emerged as a promising solution to address data scarcity in healthcare, where privacy concerns, regulatory barriers, and the high cost of data acquisition limit access to real patient datasets. Machine learning models in this domain often operate in low-data regimes, with training set sizes frequently ranging from only 100 to 650 records—conditions that hinder model generalization and increase risks of overfitting and bias. SDG addresses these challenges by producing artificial samples that mimic real-world patient data, enabling robust and privacy-preserving model development.
Objective:
This study was a comprehensive assessment of SDG-augmented training across a wide array of models—both pre-trained and non-pre-trained—for outcome prediction in 13 healthcare datasets. For small datasets of size 50 and 350 records, we answer three key questions: (1) Do pre-trained SDG models generate more effective augmentations than non-pre-trained counterparts for small datasets? (2) Is augmentation beneficial for both pre-trained and non-pre-trained classifiers for small datasets? (3) Among three state-of-the-art classifying models, which offers the best predictive performance on small datasets?
Methods:
The three classifiers considered were light gradient boosted trees, large language models (LLM) adapted to tabular data and TabPFN, a recent transformer-based method that became the new state of the art in terms of tabular data classification. Each classifier has been augmented through different SDG methods: current state of the art (Bayesian networks, CTGAN, TVAE and sequential trees) and using LLMs for tabular data generation.
Results:
Augmented TabPFN demonstrated superior performance, yielding significantly higher AUC and ICI scores compared to other classifiers. Post-hoc analysis revealed that for the dataset sizes examined, SDG and LLM models exhibited overfitting tendencies. Notably, simple dataset augmentation through sampling with replacement achieved comparable improvements in TabPFN performance that relies on its in-context learning capabilities. This finding suggests that the benefits of augmentation for TabPFN for such small datasets stem primarily from increased dataset size.
Conclusions:
Given its superior performance and minimal computational overhead, we recommend augmenting TabPFN through sampling with replacement as the optimal approach for small-data classification tasks. This method not only achieves the highest classification performance among tested approaches but also offers significant computational advantages over more complex augmentation techniques such as SDG models, making it particularly suitable for resource-constrained applications. Clinical Trial: N/A
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.