Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Jul 23, 2025
Date Accepted: Apr 13, 2026
Machine learning prediction model for dyslipidemia and its association with atherothrombotic events in three independent cohorts from South Korea, Japan, and the UK: an algorithm development and validation study
ABSTRACT
Background:
Dyslipidemia is a multifactorial and complex condition that warrants investigation through advanced analytical approaches such as machine learning (ML). Despite its clinical significance, no AI studies to date have been validated through multinational datasets.
Objective:
This study aimed to develop a machine learning model to predict the 5-year incidence of dyslipidemia using routinely collected health examination data. To ensure generalizability, the model was externally validated in populations from South Korea, Japan, and the UK. Furthermore, the clinical relevance of the model-derived risk was evaluated by examining its association with atherosclerotic outcomes, including acute myocardial infarction, cerebral infarction, and all-cause mortality.
Methods:
This study was conducted using three independent, large-scale, population-based cohorts. The discovery cohort from South Korea (NHIS-NSC cohort; n=1,062,018) was utilized for model training and internal validation, while two validation cohorts from Japan (validation A [JMDC cohort]; n=21,517,570) and the UK (validation B [UK Biobank]; n=502,367) were used for external validation. We evaluated various ML-based models using 23 features extracted from regular health screening data to predict the new onset of dyslipidemia within five years. Shapley Additive Explanation (SHAP) value were calculated to assess feature importance. To ensure the robustness of the proposed ML model, we evaluated the risk of atherothrombotic events (acute myocardial infarction or cerebral infarction) and mortality based on the model probability (tertiles; T1, T2, and T3), using Cox proportional hazards model.
Results:
In the discovery cohort, soft-voting ensemble learning with LightGBM and CatBoost exhibited performance metrics of area under the receiver operating characteristic curve (AUROC) 78.4%, precision 56.9%, and area under precision recall curve 47.1%. This model showed consistent performance in the validation cohort A (AUROC, 74.4%) and cohort B (AUROC, 68.8%). SHAP value analysis identified smoking, alcohol consumption, and physical activity as the most important features for predicting dyslipidemia. Finally, a higher model probability (T3 versus reference) was pronounced with an increased risk of acute myocardial infarction (adjusted hazard ratios, 1.76 [95% CI, 1.21–2.55]), cerebral infarction (1.22 [1.03–1.46]), and mortality (1.30 [1.01–1.69]).
Conclusions:
This multi-national study developed and validated a ML-based model using routine health checkup data to predict the five-year risk of new-onset dyslipidemia, which was also associated with mortality and atherosclerotic events.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.