Currently submitted to: JMIR Medical Informatics
Date Submitted: Mar 22, 2026
Open Peer Review Period: Apr 2, 2026 - May 28, 2026
(currently open for review)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Development of a Personal Health Management Service Using Clinical Data Warehouse Data: An Algorithm for Chronic Disease Prediction
ABSTRACT
Background:
The increasing burden of chronic diseases such as hypertension and diabetes necessitates a shift from reactive to proactive preventive care. This transition is now feasible through the convergence of large-scale health data, machine learning (ML), and patient-centered policies, such as South Korea’s MyData initiative.
Objective:
The objective of this study was to develop and validate ML models using routine health-screening data to predict the onset of hypertension and diabetes, thereby providing an evidence-based foundation for personalized, data-driven prevention.
Methods:
We constructed a cohort using data from the Clinical Data Warehouse (CDW) of Seoul St. Mary’s Hospital. Two distinct datasets were analyzed: 21,589 individuals for essential hypertension prediction and 22,255 individuals for type 2 diabetes mellitus prediction. Five ML models were used to classify disease onset. The final models were selected based on a comprehensive evaluation of the area under the receiver operating characteristic curve (AUROC) and the F1 score. Finally, the importance of variables in the selected models was confirmed using Shapley Additive Explanation (SHAP) values.
Results:
Among the models tested, logistic regression was used to predict essential hypertension and type 2 diabetes mellitus. The models demonstrated high predictive performance, with an AUROC of 0.842 for hypertension and 0.954 for diabetes. SHAP analysis revealed that age was the most influential predictor of hypertension, whereas HbA1c was the most significant predictor of diabetes.
Conclusions:
We successfully developed prediction models for hypertension and diabetes that are applicable within MyData services. These models have the potential to empower individuals in data-driven self-management and to enhance personalized disease prevention.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.