Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Bioinformatics and Biotechnology

Date Submitted: Dec 28, 2024
Date Accepted: Jun 20, 2025

The final, peer-reviewed published version of this preprint can be found here:

Optimizing Feature Selection and Machine Learning Algorithms for Early Detection of Prediabetes Risk: Comparative Study

BIOCORE Research Group MB, Almadhoun MB, Burhanuddin M

Optimizing Feature Selection and Machine Learning Algorithms for Early Detection of Prediabetes Risk: Comparative Study

JMIR Bioinform Biotech 2025;6:e70621

DOI: 10.2196/70621

PMID: 41342190

PMCID: 12314567

Optimizing Feature Selection and Machine Learning Algorithms for Early Detection of Prediabetes Risk: A Comparative Study

  • Mahmoud BA BIOCORE Research Group; 
  • Mahmoud BA Almadhoun; 
  • MA Burhanuddin

ABSTRACT

Background:

Prediabetes is an intermediate stage between normal glucose metabolism and diabetes and is associated with increased risk of complications like cardiovascular disease and kidney failure.

Objective:

It is crucial to recognize prediabetes individuals early in order to apply timely intervention strategies to decelerate or prohibit diabetes development.

Methods:

Multiple machine learning models are evaluated in this study, including Random Forest, XGBoost, Support Vector Machine (SVM), and K-Nearest Neighbors (KNN). For improved performance and interpretability, key clinical features were selected using Lasso Regression, including Body Mass Index (BMI), Age, LDL-C, and HDL-C. To optimize model accuracy and reduce overfitting, we employed hyperparameter tuning with RandomizedSearchCV for XGBoost and Random Forest and GridSearchCV for SVM, and KNN. To resolve data imbalance, Synthetic Minority Oversampling Technique (SMOTE) was applied to ensure reliable classifications.

Results:

A cross-validated ROC-AUC score of 0.9117 highlighted the robustness of Random Forest in generalizing across datasets among the models tested. XGBoost followed closely, providing balanced accuracy in distinguishing between normal and prediabetes cases. While SVMs and KNNs performed adequately as baseline models, they exhibited limitations in sensitivity.

Conclusions:

It is demonstrated in this research that optimized machine learning models, especially Random Forest and XGBoost, are effective tools for assessing early prediabetes risk. Future directions include validating these models in diverse clinical settings and integrating additional biomarkers to improve prediction accuracy, offering a promising avenue for early intervention and personalized treatment strategies in preventive healthcare.


 Citation

Please cite as:

BIOCORE Research Group MB, Almadhoun MB, Burhanuddin M

Optimizing Feature Selection and Machine Learning Algorithms for Early Detection of Prediabetes Risk: Comparative Study

JMIR Bioinform Biotech 2025;6:e70621

DOI: 10.2196/70621

PMID: 41342190

PMCID: 12314567

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.