Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Nov 29, 2019
Date Accepted: Mar 28, 2020

The final, peer-reviewed published version of this preprint can be found here:

Accurate Prediction of Coronary Heart Disease for Patients With Hypertension From Electronic Health Records With Big Data and Machine-Learning Methods: Model Development and Performance Evaluation

Du Z, Yang Y, Zheng J, Li Q, Lin D, Li Y, Fan J, Cheng W, Chen XH, Cai Y

Accurate Prediction of Coronary Heart Disease for Patients With Hypertension From Electronic Health Records With Big Data and Machine-Learning Methods: Model Development and Performance Evaluation

JMIR Med Inform 2020;8(7):e17257

DOI: 10.2196/17257

PMID: 32628616

PMCID: 7381262

Accurate Prediction of coronary heart disease for hypertensive patients from electrical health records: the power of big data and machine learning methods

  • Zhenzhen Du; 
  • Yujie Yang; 
  • Jing Zheng; 
  • Qi Li; 
  • Denan Lin; 
  • Ye Li; 
  • Jianping Fan; 
  • Wen Cheng; 
  • Xie-Hui Chen; 
  • Yunpeng Cai

ABSTRACT

Background:

Predictions of cardiovascular disease risks based on health record have long attracted broad research interests but the prediction accuracy remained unsatisfactory despite extensive efforts made. A question is raised on whether the data insufficiency, the statistical and machine learning methods, or the intrinsic noises had hindered the performance of previous approaches, and how can they be alleviated.

Objective:

Based on a large population of hypertensive patients collected in Shenzhen, we aim to establish a high-precision coronary heart disease (CHD) prediction model through big data and machine learning methods. A comparison study is also performed with traditional approaches to determine the key factors that affect model precision.

Methods:

A large cohort of 42676 registered hypertension patients with 20156 CHD onset was investigated with their electronic health records (EHR) records in 1-3 years prior to CHD onsets (for positive cases) or in a disease-free follow-up period of more than 3 years (for negative cases). The population was divided evenly into independent training and test datasets. Various machine learning methods were adopted on the training set to achieve high-accuracy prediction models and the results were compared with traditional statistical methods and well-known risk scales. Comparison studies were also performed which reflected the effects of training sample size, factor sets and modeling approaches on the prediction performances.

Results:

An ensemble method, XGBoost, achieved a high accuracy of 0.943 (Area under ROC curve) on predicting 3-year CHD onset on the independent test dataset. Comparison studies showed that non-linear models (kNN, AUC 0.908, or random forests, AUC 0.938) outperform linear models (logistic regression, AUC 0.865) on the same data sets, and machine learning methods significantly surpassed traditional risk scales or fixed models (such as the Framingham CVD risk models). Further analyses discovered that using time-dependent features obtained from multiple records, including both statistical variables and changing-trend variables, helped with improving the performance than using static features only. Sub-population studies showed that the impact of feature design were more significant to the model accuracy than the population size. Marginal effect analysis showed that both traditional and EHR factors exhibited highly non-linear characteristics with respect to the risk scores.

Conclusions:

We demonstrated that accurate risk prediction of coronary heart disease from electrical health records are possible given a sufficient large population of training instances. Sophisticated machine learning methods played an important role in tackling the heterogeneity and non-linear nature of disease prediction. Moreover, accumulated EHR over multiple time points provided additional features that are valuable for risk prediction. Our study justified the importance of accumulating EHR big data for accurate disease predictions.


 Citation

Please cite as:

Du Z, Yang Y, Zheng J, Li Q, Lin D, Li Y, Fan J, Cheng W, Chen XH, Cai Y

Accurate Prediction of Coronary Heart Disease for Patients With Hypertension From Electronic Health Records With Big Data and Machine-Learning Methods: Model Development and Performance Evaluation

JMIR Med Inform 2020;8(7):e17257

DOI: 10.2196/17257

PMID: 32628616

PMCID: 7381262

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.