JMIR Preprints #70621: Optimizing Feature Selection and Machine Learning Algorithms for Early Detection of Prediabetes Risk: A Comparative Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Optimizing Feature Selection and Machine Learning Algorithms for Early Detection of Prediabetes Risk: A Comparative Study

Mahmoud BA BIOCORE Research Group;
Mahmoud BA Almadhoun;
MA Burhanuddin

ABSTRACT

Background:

Prediabetes is an intermediate stage between normal glucose metabolism and diabetes and is associated with increased risk of complications like cardiovascular disease and kidney failure.

Objective:

It is crucial to recognize prediabetes individuals early in order to apply timely intervention strategies to decelerate or prohibit diabetes development.

Methods:

Multiple machine learning models are evaluated in this study, including Random Forest, XGBoost, Support Vector Machine (SVM), and K-Nearest Neighbors (KNN). For improved performance and interpretability, key clinical features were selected using Lasso Regression, including Body Mass Index (BMI), Age, LDL-C, and HDL-C. To optimize model accuracy and reduce overfitting, we employed hyperparameter tuning with RandomizedSearchCV for XGBoost and Random Forest and GridSearchCV for SVM, and KNN. To resolve data imbalance, Synthetic Minority Oversampling Technique (SMOTE) was applied to ensure reliable classifications.

Results:

A cross-validated ROC-AUC score of 0.9117 highlighted the robustness of Random Forest in generalizing across datasets among the models tested. XGBoost followed closely, providing balanced accuracy in distinguishing between normal and prediabetes cases. While SVMs and KNNs performed adequately as baseline models, they exhibited limitations in sensitivity.

Conclusions:

It is demonstrated in this research that optimized machine learning models, especially Random Forest and XGBoost, are effective tools for assessing early prediabetes risk. Future directions include validating these models in diverse clinical settings and integrating additional biomarkers to improve prediction accuracy, offering a promising avenue for early intervention and personalized treatment strategies in preventive healthcare.

Citation

Please cite as:

BIOCORE Research Group MB, Almadhoun MB, Burhanuddin M

Optimizing Feature Selection and Machine Learning Algorithms for Early Detection of Prediabetes Risk: Comparative Study

JMIR Bioinform Biotech 2025;6:e70621

DOI: 10.2196/70621

PMID: 41342190

PMCID: 12314567

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Bioinformatics and Biotechnology

Date Submitted: Dec 28, 2024

Date Accepted: Jun 20, 2025

Optimizing Feature Selection and Machine Learning Algorithms for Early Detection of Prediabetes Risk: A Comparative Study

ABSTRACT

Citation

Copyright