Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Jul 23, 2025
Date Accepted: Apr 13, 2026

The final, peer-reviewed published version of this preprint can be found here:

Machine Learning Prediction Model for Dyslipidemia and Its Association With Atherothrombotic Events in 3 Independent Cohorts From South Korea, Japan, and the United Kingdom: Algorithm Development and Validation Study

Kim TH, Kim S, Kim Y, Lee H, Hwang SH, Yang SY, Smith L, Hajek A, Woo S, Yon DK

Machine Learning Prediction Model for Dyslipidemia and Its Association With Atherothrombotic Events in 3 Independent Cohorts From South Korea, Japan, and the United Kingdom: Algorithm Development and Validation Study

JMIR Med Inform 2026;14:e81130

DOI: 10.2196/81130

PMID: 42155057

PMCID: 13186430

Machine learning prediction model for dyslipidemia and its association with atherothrombotic events in three independent cohorts from South Korea, Japan, and the UK: an algorithm development and validation study

  • Tae Hyeon Kim; 
  • Soeun Kim; 
  • Yerim Kim; 
  • Hayeon Lee; 
  • Seung Ha Hwang; 
  • So Young Yang; 
  • Lee Smith; 
  • André Hajek; 
  • Selin Woo; 
  • Dong Keon Yon

ABSTRACT

Background:

Dyslipidemia is a multifactorial and complex condition that warrants investigation through advanced analytical approaches such as machine learning (ML). Despite its clinical significance, no AI studies to date have been validated through multinational datasets.

Objective:

This study aimed to develop a machine learning model to predict the 5-year incidence of dyslipidemia using routinely collected health examination data. To ensure generalizability, the model was externally validated in populations from South Korea, Japan, and the UK. Furthermore, the clinical relevance of the model-derived risk was evaluated by examining its association with atherosclerotic outcomes, including acute myocardial infarction, cerebral infarction, and all-cause mortality.

Methods:

This study was conducted using three independent, large-scale, population-based cohorts. The discovery cohort from South Korea (NHIS-NSC cohort; n=1,062,018) was utilized for model training and internal validation, while two validation cohorts from Japan (validation A [JMDC cohort]; n=21,517,570) and the UK (validation B [UK Biobank]; n=502,367) were used for external validation. We evaluated various ML-based models using 23 features extracted from regular health screening data to predict the new onset of dyslipidemia within five years. Shapley Additive Explanation (SHAP) value were calculated to assess feature importance. To ensure the robustness of the proposed ML model, we evaluated the risk of atherothrombotic events (acute myocardial infarction or cerebral infarction) and mortality based on the model probability (tertiles; T1, T2, and T3), using Cox proportional hazards model.

Results:

In the discovery cohort, soft-voting ensemble learning with LightGBM and CatBoost exhibited performance metrics of area under the receiver operating characteristic curve (AUROC) 78.4%, precision 56.9%, and area under precision recall curve 47.1%. This model showed consistent performance in the validation cohort A (AUROC, 74.4%) and cohort B (AUROC, 68.8%). SHAP value analysis identified smoking, alcohol consumption, and physical activity as the most important features for predicting dyslipidemia. Finally, a higher model probability (T3 versus reference) was pronounced with an increased risk of acute myocardial infarction (adjusted hazard ratios, 1.76 [95% CI, 1.21–2.55]), cerebral infarction (1.22 [1.03–1.46]), and mortality (1.30 [1.01–1.69]).

Conclusions:

This multi-national study developed and validated a ML-based model using routine health checkup data to predict the five-year risk of new-onset dyslipidemia, which was also associated with mortality and atherosclerotic events.


 Citation

Please cite as:

Kim TH, Kim S, Kim Y, Lee H, Hwang SH, Yang SY, Smith L, Hajek A, Woo S, Yon DK

Machine Learning Prediction Model for Dyslipidemia and Its Association With Atherothrombotic Events in 3 Independent Cohorts From South Korea, Japan, and the United Kingdom: Algorithm Development and Validation Study

JMIR Med Inform 2026;14:e81130

DOI: 10.2196/81130

PMID: 42155057

PMCID: 13186430

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.