Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Applications and methods to develop artificial intelligence-based population-specific risk models for predicting first and recurrent cardio/cerebrovascular events: PowerAI-CVD Showcase
ABSTRACT
Background:
Our team was the first in Hong Kong to develop machine learning-enhanced risk models for predicting first and recurrent events of cardiovascular disease in predominantly Chinese subjects using territory-wide data from our specific geographical region. Initially >500 risk variables from demographics (age, sex, source of admissions, ethnicity, number of hospitalisations prior to the index date), physiological status (systolic blood pressure [SBP], diastolic blood pressure [DBP], mean blood pressure [MBP], variability of SBP, DBP and MBP), disease diagnoses from 18 systems/organs, laboratory test results (complete blood count, liver and renal function, lipids, glycemic tests), and medications (23 categories) were considered. The PowerAI-CVD model is a simpler model with 19 variables, requiring less computational power but nevertheless exhibiting high discriminative power with a c-statistic of 0.89.
Objective:
Arising from this project was a series of graphical user interface (GUI)-based applications and tools that can be used for longitudinal analysis of routinely collected electronic health records from Hong Kong, which we termed Open-source disease analyzer toolkit (ODAT).
Methods:
ODAT was developed using Python. It is publicly available from this URL: https://odat.info/ and released under GNU GPLv3 on Github (https://github.com/ODAT-Project), which is fully free and open-source for research or commercial use.
Results:
ODAT contains three chapters. Chapter 1: data cleaning, processing and dataset creation. Chapter 2: automating data analysis and risk modelling using traditional Cox and machine learning method (XGBoost, Gradient Boosting, Multilayer Perceptron, Random Forest, Naïve Bayes, Decision Tree, k-Nearest Neighbor, AdaBoost, and SVM-Sigmoid model). Using the top performing machine learning model as a showcase (XGBoost), nonlinear terms can be fed into traditional Cox regression models to enhance risk prediction. Chapter 3: graphical outputs of risk outputs over a 1, 3, 5, 10 and 20-year period, and interactive platforms to illustrate how the risk estimates alter after selecting and deselecting treatment options.
Conclusions:
Our tools enable epidemiologists, public health practitioners and researchers to develop risk models with friendly GUIs, starting from database building, to variable selection, and model building.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.