Accepted for/Published in: JMIR Public Health and Surveillance
Date Submitted: Apr 12, 2021
Date Accepted: Jul 31, 2021
Uncovering clinical risk factors and prediction of severe COVID-19: A machine learning approach based on UK Biobank data
ABSTRACT
Background:
COVID-19 is a major public health concern. Given the extent of the pandemic, it is urgent to identify risk factors associated with disease severity. More accurate prediction of those at risk of developing severe infections is of high clinical importance.
Objective:
Based on the UK-Biobank(UKBB), we aimed to build machine learning(ML) models to predict the risk of developing severe or fatal infections, and uncover major risk factors involved.
Methods:
We first restricted the analysis to infected subjects(N=7846), then performed analysis at a population level, considering those with no known infection as controls(N controls=465,728). Hospitalization was used as a proxy for severity. Totally 97 clinical variables(collected prior to COVID-19 outbreak) covering demographic variables, comorbidities, blood measurements(e.g. hematological/liver/renal function/metabolic parameters), anthropometric measures and other risk factors(e.g. smoking/drinking) were included as predictors. We also constructed a simplified(‘lite’) prediction model using 27 covariates that can be more easily obtained(demographic and comorbidity data). XGboost(gradient-bosted trees) was used for prediction and predictive performance assessed by cross-validation. Variable importance was quantified by Shapley values and accuracy gain. Shapley dependency and interaction plots were used to evaluate the pattern of relationship between risk factors and outcomes.
Results:
Totally 2386 severe and 477 fatal cases were identified. For analysis among infected individuals (N=7846),our prediction model achieved AUCs of 0.723(95% CI:0.711-0.736) and 0.814(CI:0.791-0.838) for severe and fatal infections respectively. The top five contributing factors for severity were age, number of drugs taken(cnt_tx), cystatin C(reflecting renal function), waist-hip ratio(WHR) and Townsend Deprivation index(TDI). For mortality, the top features were age, testosterone, cnt_tx, waist circumference(WC) and red cell distribution width(RDW). In analyses involving the whole UKBB population, corresponding AUCs for severity and fatality were 0.696(CI:0.684-0.708) and 0.802(CI:0.778-0.826) respectively. The same top five risk factors were identified for both outcomes, namely age, cnt_tx, WC, WHR and TDI. Apart from the above features, Type 2 diabetes(T2DM), HbA1c and apolipoprotein A were ranked among the top 10 in at least two (out of four) analyses. Age, cystatin C, TDI and cnt_tx were among the top 10 across all four analyses. For the ‘lite’ models, the predictive performances are broadly similar, with estimated AUCs of 0.716, 0.818, 0.696 and 0.811 respectively. The top-ranked variables were similar to above, including e.g. age, cnt_tx, WC, male and T2DM.
Conclusions:
We identified a number of baseline clinical risk factors for severe/fatal infection by ML. For example, age, central obesity, impaired renal function, multi-comorbidities and cardiometabolic abnormalities may predispose to poorer outcomes. The presented prediction models may be useful at a population level to identify those susceptible to developing severe/fatal infections, facilitating targeted prevention strategies. A risk prediction tool is also available online. Further replications in independent cohorts are required to verify our findings. Clinical Trial: NA
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.