Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Oct 6, 2025
Date Accepted: Apr 16, 2026

The final, peer-reviewed published version of this preprint can be found here:

Enhancing Early Prediction of Gestational Diabetes Mellitus Through Data Augmentation and Feature Guidance: Model Development and Validation Study

Chen X, Jiang Z, Su D, Chen X, Chen A, Zhang Z, Wang H

Enhancing Early Prediction of Gestational Diabetes Mellitus Through Data Augmentation and Feature Guidance: Model Development and Validation Study

JMIR Med Inform 2026;14:e85335

DOI: 10.2196/85335

PMID: 42184375

Enhancing Early Prediction of Gestational Diabetes Mellitus through Data Augmentation and Feature Guidance: Model Development and Validation Study

  • Xiekun Chen; 
  • Zhifa Jiang; 
  • Dong Su; 
  • Xiaoping Chen; 
  • Aiping Chen; 
  • Zhen Zhang; 
  • Huabin Wang

ABSTRACT

Background:

Early prediction of gestational diabetes mellitus (GDM) is critical for improving maternal health outcomes. However, predictive models are often challenged by limited early-pregnancy samples, severe class imbalance in datasets, and complex interrelationships among clinical features.

Objective:

This study aimed to develop and evaluate a unified dual-dimensional enhancement framework that integrates data augmentation and feature engineering to improve early GDM prediction performance by addressing data imbalance and leveraging medical prior knowledge.

Methods:

We proposed a framework combining Generative Adversarial Network (GAN)-based data augmentation with Large Language Model (LLM)-inspired feature engineering. GAN sampling was used to generate clinically plausible synthetic minority class samples to mitigate data imbalance. The LLM was guided to organize features into domains (e.g., basic demographics, metabolic syndrome, core liver biomarkers) and generate higher-order composite features, integrating medical prior knowledge. Machine learning models were subsequently developed, and interpretability analyses were performed using SHAP to identify key predictors.

Results:

The Random Forest model enhanced by TVAE-based feature augmentation demonstrated the best performance. On the test dataset, it achieved a recall of 0.7559, an accuracy of 0.8444, and an area under the receiver operating characteristic curve (AUROC) of 0.8873. SHAP analysis identified the following five features as the most influential predictors: fasting blood glucose, the composite feature (fasting blood glucose + triglycerides) × pre-pregnancy BMI, activated partial thromboplastin time, leukocyte count, and neutrophil count.

Conclusions:

The proposed dual-dimensional enhancement framework effectively alleviates data limitations and captures complex feature interactions in early GDM prediction. This strategy not only improves model performance, particularly in recall, but also provides interpretable biological evidence to support rapid clinical screening, stratified management, and early intervention in pregnancy.


 Citation

Please cite as:

Chen X, Jiang Z, Su D, Chen X, Chen A, Zhang Z, Wang H

Enhancing Early Prediction of Gestational Diabetes Mellitus Through Data Augmentation and Feature Guidance: Model Development and Validation Study

JMIR Med Inform 2026;14:e85335

DOI: 10.2196/85335

PMID: 42184375

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.