Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Public Health and Surveillance

Date Submitted: Apr 12, 2021
Date Accepted: Jul 31, 2021

The final, peer-reviewed published version of this preprint can be found here:

Uncovering Clinical Risk Factors and Predicting Severe COVID-19 Cases Using UK Biobank Data: Machine Learning Approach

Wong KCY, Xiang Y, Yin L, So HC

Uncovering Clinical Risk Factors and Predicting Severe COVID-19 Cases Using UK Biobank Data: Machine Learning Approach

JMIR Public Health Surveill 2021;7(9):e29544

DOI: 10.2196/29544

PMID: 34591027

PMCID: 8485986

Uncovering clinical risk factors and prediction of severe COVID-19: A machine learning approach based on UK Biobank data

  • Kenneth Chi-Yin Wong; 
  • Yong Xiang; 
  • Liangying Yin; 
  • Hon-Cheong So

ABSTRACT

Background:

COVID-19 is a major public health concern. Given the extent of the pandemic, it is urgent to identify risk factors associated with disease severity. More accurate prediction of those at risk of developing severe infections is of high clinical importance.

Objective:

Based on the UK-Biobank(UKBB), we aimed to build machine learning(ML) models to predict the risk of developing severe or fatal infections, and uncover major risk factors involved.

Methods:

We first restricted the analysis to infected subjects(N=7846), then performed analysis at a population level, considering those with no known infection as controls(N controls=465,728). Hospitalization was used as a proxy for severity. Totally 97 clinical variables(collected prior to COVID-19 outbreak) covering demographic variables, comorbidities, blood measurements(e.g. hematological/liver/renal function/metabolic parameters), anthropometric measures and other risk factors(e.g. smoking/drinking) were included as predictors. We also constructed a simplified(‘lite’) prediction model using 27 covariates that can be more easily obtained(demographic and comorbidity data). XGboost(gradient-bosted trees) was used for prediction and predictive performance assessed by cross-validation. Variable importance was quantified by Shapley values and accuracy gain. Shapley dependency and interaction plots were used to evaluate the pattern of relationship between risk factors and outcomes.

Results:

Totally 2386 severe and 477 fatal cases were identified. For analysis among infected individuals (N=7846),our prediction model achieved AUCs of 0.723(95% CI:0.711-0.736) and 0.814(CI:0.791-0.838) for severe and fatal infections respectively. The top five contributing factors for severity were age, number of drugs taken(cnt_tx), cystatin C(reflecting renal function), waist-hip ratio(WHR) and Townsend Deprivation index(TDI). For mortality, the top features were age, testosterone, cnt_tx, waist circumference(WC) and red cell distribution width(RDW). In analyses involving the whole UKBB population, corresponding AUCs for severity and fatality were 0.696(CI:0.684-0.708) and 0.802(CI:0.778-0.826) respectively. The same top five risk factors were identified for both outcomes, namely age, cnt_tx, WC, WHR and TDI. Apart from the above features, Type 2 diabetes(T2DM), HbA1c and apolipoprotein A were ranked among the top 10 in at least two (out of four) analyses. Age, cystatin C, TDI and cnt_tx were among the top 10 across all four analyses. For the ‘lite’ models, the predictive performances are broadly similar, with estimated AUCs of 0.716, 0.818, 0.696 and 0.811 respectively. The top-ranked variables were similar to above, including e.g. age, cnt_tx, WC, male and T2DM.

Conclusions:

We identified a number of baseline clinical risk factors for severe/fatal infection by ML. For example, age, central obesity, impaired renal function, multi-comorbidities and cardiometabolic abnormalities may predispose to poorer outcomes. The presented prediction models may be useful at a population level to identify those susceptible to developing severe/fatal infections, facilitating targeted prevention strategies. A risk prediction tool is also available online. Further replications in independent cohorts are required to verify our findings. Clinical Trial: NA


 Citation

Please cite as:

Wong KCY, Xiang Y, Yin L, So HC

Uncovering Clinical Risk Factors and Predicting Severe COVID-19 Cases Using UK Biobank Data: Machine Learning Approach

JMIR Public Health Surveill 2021;7(9):e29544

DOI: 10.2196/29544

PMID: 34591027

PMCID: 8485986

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.