Previously submitted to: Interactive Journal of Medical Research (no longer under consideration since Sep 06, 2023)
Date Submitted: Apr 24, 2023
Open Peer Review Period: Apr 18, 2023 - May 2, 2023
(closed for review but you can still tweet)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Prediction of Diabetes Using Machine Learning and Data Mining Algorithm
ABSTRACT
Background:
Introduction: Currently, diabetes is known as one of the major health problems and the most important issue in the medical profession which has a high prevalence in children and adults. On the other hand, machine learning has been introduced as a developing, reliable, and supportive technology in the field of health, and one of the interesting techniques for analyzing interventions, diseases, and conditions of the health system is the use of data mining. In fact, data mining is the process of selecting, exploring, and modeling large amounts of data.
Objective:
The present study was performed in order to predict fasting blood sugar status using machine learning and data mining.
Methods:
The data used in this study was from a diabetes screening program in Tehran. 3376 participants over 30 years old in 16 comprehensive health service centers participated in this screening program to check the prevalence of diabetes and its related risk factors. The dataset was not balanced according to the output variable. Therefore, the random sampling method and SMOTE technique were used for making a balance. Four different machine learning algorithms including CatBoost, Random Forest, XGBoost, and logistic regression were used to model the dataset. Also, the Shapley technique was used to select the most important features. Accuracy, sensitivity, specificity, accuracy, F1- Score, and AUC criteria were used to evaluate the model.
Results:
The results of the Shapely technique in selecting the most important features showed that the characteristics of age, waist-to-hip ratio, body mass index, and systolic blood pressure were the most important factors in predicting fasting blood sugar status. Also, the modeling results showed that the CatBoost algorithm gave the best results. For the CatBoost algorithm, various evaluation criteria including accuracy, sensitivity, specificity, and AUC were obtained as 65.98%, 71.32%, 64.54%, and 0.74% respectively.
Conclusions:
In this study, a predictive model was developed using gradient-improved decision tree algorithms to identify the most important risk factors related to diabetes. Age, waist-to-hip ratio, body mass index, and systolic blood pressure were the most important risk factors for diabetes, respectively. This model can be used in the planning for diabetes management.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.