Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Oct 6, 2025
Date Accepted: Apr 16, 2026
Enhancing Early Prediction of Gestational Diabetes Mellitus through Data Augmentation and Feature Guidance: Model Development and Validation Study
ABSTRACT
Background:
Early prediction of gestational diabetes mellitus (GDM) is critical for improving maternal health outcomes. However, predictive models are often challenged by limited early-pregnancy samples, severe class imbalance in datasets, and complex interrelationships among clinical features.
Objective:
This study aimed to develop and evaluate a unified dual-dimensional enhancement framework that integrates data augmentation and feature engineering to improve early GDM prediction performance by addressing data imbalance and leveraging medical prior knowledge.
Methods:
We proposed a framework combining Generative Adversarial Network (GAN)-based data augmentation with Large Language Model (LLM)-inspired feature engineering. GAN sampling was used to generate clinically plausible synthetic minority class samples to mitigate data imbalance. The LLM was guided to organize features into domains (e.g., basic demographics, metabolic syndrome, core liver biomarkers) and generate higher-order composite features, integrating medical prior knowledge. Machine learning models were subsequently developed, and interpretability analyses were performed using SHAP to identify key predictors.
Results:
The Random Forest model enhanced by TVAE-based feature augmentation demonstrated the best performance. On the test dataset, it achieved a recall of 0.7559, an accuracy of 0.8444, and an area under the receiver operating characteristic curve (AUROC) of 0.8873. SHAP analysis identified the following five features as the most influential predictors: fasting blood glucose, the composite feature (fasting blood glucose + triglycerides) × pre-pregnancy BMI, activated partial thromboplastin time, leukocyte count, and neutrophil count.
Conclusions:
The proposed dual-dimensional enhancement framework effectively alleviates data limitations and captures complex feature interactions in early GDM prediction. This strategy not only improves model performance, particularly in recall, but also provides interpretable biological evidence to support rapid clinical screening, stratified management, and early intervention in pregnancy.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.