JMIR Preprints #77276: Predictive Modeling for Type 2 Diabetes Risk Using Machine Learning: A Focus on Kuwait’s Growing Health Challenge

Predictive Modeling for Type 2 Diabetes Risk Using Machine Learning: A Focus on Kuwait’s Growing Health Challenge

Manayer Althwaithi;
Khaled Mohamad Almustafa;
Manayer Althwaithi

ABSTRACT

Background:

This study investigates the use of machine learning (ML) models to predict type 2 diabetes risk in Kuwait, where diabetes prevalence is rising rapidly. Five classifiers—Random Forest, Logistic Regression, SGD Classifier, Support Vector Machine, and AdaBoost—were evaluated using accuracy, precision, recall, and F1-score. Random Forest achieved the highest accuracy (77.06%) and F1-score (0.67), followed closely by Logistic Regression and SGD. Feature importance analysis identified glucose levels, BMI, and age as the most influential predictors. Sensitivity analysis highlighted the impact of hyperparameter tuning, especially in SGD and SVM models. Based on performance and interpretability, Random Forest is recommended as the primary predictive tool, while Logistic Regression may support risk screening in primary care, and SGD is well-suited for real-time applications. Emphasizing key clinical features and tuning model parameters can significantly enhance predictive accuracy. This work supports the integration of ML into diabetes screening programs in Kuwait, offering a scalable, data-driven approach to early detection and healthcare planning.

Objective:

The paper aims to develop and evaluate machine learning models for the prediction of Type 2 Diabetes (T2D) risk, specifically tailored to the Kuwaiti population. It focuses on integrating clinical, demographic, genetic, and lifestyle factors into various classifiers—including Random Forest, Logistic Regression, SGD Classifier, SVM, and AdaBoost—to determine the most effective predictive model. The overarching goal is to support early detection and informed healthcare decision-making in Kuwait by identifying high-risk individuals and enabling data-driven, region-specific preventive strategies.

Methods:

Data Collection Data was sourced from: Public health datasets (e.g., Ministry of Health Kuwait, WHO, IDF) Clinical records from institutions such as the Dasman Diabetes Institute Data included demographic, lifestyle, and clinical variables (e.g., glucose, BMI, age, blood pressure). Data Preprocessing Cleaning: Missing values handled via imputation or deletion Normalization: Continuous features were scaled for consistency Anonymization: Patient identifiers were removed Formatting: Data was encoded to be ML-ready Feature Selection Selected key predictors based on domain knowledge and literature: Glucose, BMI, Age, Blood Pressure, Insulin, Diabetes Pedigree Function, Pregnancies, etc. Feature importance was later analyzed per model Model Selection and Implementation Five machine learning classifiers were developed and tested: Random Forest Logistic Regression SGD Classifier Support Vector Machine (SVM) AdaBoost Models were trained using preprocessed data and optimized through hyperparameter tuning Evaluation Metrics Models were assessed using: Accuracy F1-Score Precision Recall ROC-AUC MAE (Mean Absolute Error) RMSE (Root Mean Square Error) Confusion Matrix Sensitivity Analysis Conducted for all classifiers to evaluate performance under different hyperparameter settings Example: Number of estimators and max depth in Random Forest, regularization strength in Logistic Regression, learning rate in SGD Validation and Generalization Results validated through comparative performance Plans for validation on external Kuwaiti clinical datasets were outlined for future work

Results:

Best Performing Model: Random Forest Accuracy: 77.06% F1-score: 0.67 (initial); later experiments showed up to 0.84 Precision: 72.60% – 85% Recall: 63.10% – 83% ROC-AUC: 0.84 Random Forest consistently outperformed other models across most metrics and was the most robust and balanced model for diabetes risk prediction. Other Model Performances Logistic Regression: Accuracy: ~76.6%, F1-score: up to 0.77, ROC-AUC: 0.82 Highly interpretable, good for clinical use SGD Classifier: Accuracy: ~76.6%, F1-score: up to 0.80, but highly sensitive to learning rate Support Vector Machine (SVM): Accuracy: ~75.8%, F1-score: 0.78, ROC-AUC: 0.82 AdaBoost: Accuracy: ~74.9%–83%, F1-score: up to 0.82, ROC-AUC: 0.79 Moderate performance; sensitive to tuning and outliers Feature Importance (Consistent Across Models) Top 3 predictors: Glucose BMI Age Others included: Diabetes Pedigree Function, Blood Pressure, Insulin, Pregnancies Sensitivity Analysis Random Forest's best configuration: 100 estimators, max depth = 10 Logistic Regression: C = 0.1 yielded best F1-score SGD required precise tuning (best α = 0.001) AdaBoost showed minimal improvement beyond 50 estimators Confusion Matrix & ROC-AUC Highlights Random Forest had the best balance between true positives and true negatives SGD had the weakest recall, missing a large number of actual diabetes cases Naive Bayes had high recall (65%) but lower precision than Random Forest Final Recommendation Random Forest as the primary model for Kuwaiti healthcare applications Logistic Regression as a secondary, interpretable tool for clinics SGD Classifier for real-time/telemedicine applications

Conclusions:

his study demonstrates that machine learning models, particularly Random Forest, are highly effective in predicting Type 2 Diabetes (T2D) risk, especially within the Kuwaiti context where lifestyle and genetic factors significantly contribute to diabetes prevalence. Among the classifiers tested, Random Forest consistently achieved the highest accuracy (77.06%) and F1-score (0.67–0.84), proving to be the most robust and reliable model. It effectively balances precision and recall, making it well-suited for early detection and healthcare interventions. Logistic Regression, while slightly less accurate, offers high interpretability and is ideal for clinical settings where transparency is essential. The SGD Classifier showed potential for use in real-time systems, though it requires careful tuning. SVM and AdaBoost, while competent, did not outperform Random Forest or Logistic Regression and are better suited for specific or more complex tasks. Glucose levels, BMI, and Age were identified as the most influential predictors across all models. These findings reinforce the importance of focusing on these features in screening programs. The study highlights the value of integrating ML into Kuwait’s healthcare strategy to enable early intervention, personalized care, and data-driven decision-making. Future work will involve external validation and clinical integration to ensure model generalizability and practical deployment. Clinical Trial: NA

Citation

Please cite as:

Althwaithi M, Almustafa KM, Althwaithi M

Predictive Modeling for Type 2 Diabetes Risk Using Machine Learning: A Focus on Kuwait’s Growing Health Challenge

JMIR Preprints. 10/05/2025:77276

DOI: 10.2196/preprints.77276

URL: https://preprints.jmir.org/preprint/77276

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Previously submitted to: JMIR Bioinformatics and Biotechnology (no longer under consideration since Mar 23, 2026)

Date Submitted: May 10, 2025

Predictive Modeling for Type 2 Diabetes Risk Using Machine Learning: A Focus on Kuwait’s Growing Health Challenge

ABSTRACT

Citation

Copyright