Currently submitted to: JMIR Medical Informatics
Date Submitted: Jul 9, 2025
Open Peer Review Period: Jul 21, 2025 - Sep 15, 2025
(closed for review but you can still tweet)
NOTE: This is an unreviewed Preprint
Warning: This is a unreviewed preprint (What is a preprint?). Readers are warned that the document has not been peer-reviewed by expert/patient reviewers or an academic editor, may contain misleading claims, and is likely to undergo changes before final publication, if accepted, or may have been rejected/withdrawn (a note "no longer under consideration" will appear above).
Peer review me: Readers with interest and expertise are encouraged to sign up as peer-reviewer, if the paper is within an open peer-review period (in this case, a "Peer Review Me" button to sign up as reviewer is displayed above). All preprints currently open for review are listed here. Outside of the formal open peer-review period we encourage you to tweet about the preprint.
Citation: Please cite this preprint only for review purposes or for grant applications and CVs (if you are the author).
Final version: If our system detects a final peer-reviewed "version of record" (VoR) published in any journal, a link to that VoR will appear below. Readers are then encourage to cite the VoR instead of this preprint.
Settings: If you are the author, you can login and change the preprint display settings, but the preprint URL/DOI is supposed to be stable and citable, so it should not be removed once posted.
Submit: To post your own preprint, simply submit to any JMIR journal, and choose the appropriate settings to expose your submitted version as preprint.
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Quantifying the Predictive Power of Social Determinants of Health in Cardiometabolic Disease Progression Using XGBoost: A Retrospective Cohort Study
ABSTRACT
Background:
Cardiometabolic diseases such as type 2 diabetes (DM2) and cardiovascular disease (CVD) are influenced not only by biomedical risk factors but also by social determinants of health (SDOH). While the inclusion of SDOH in predictive models is increasingly advocated, few studies have quantified their specific contribution in a high-risk clinical cohort using robust statistical and machine learning approaches.
Objective:
This study aims to quantify the added predictive value of SDOH in predicting 5-year, 10-year and overall risk of cardiometabolic disease onset among individuals already at elevated risk, and to compare this added value across multiple modelling setups and frameworks.
Methods:
We used a large, linked dataset of 160,000 inclusion events from the ELAN data warehouse in the Netherlands, combining structured coded diagnosis and medication [GP] records with individual-level socioeconomic data from Statistics Netherlands. Individuals aged 30+ without prior DM2 or CVD were followed to assess disease progression. We trained Cox proportional hazards and XGBoost models to predict progression to DM2/CVD within 5- and 10-years and overall. All analyses were performed using the R programming language. Experiments included comparisons of SCORE2 , Cox, and XGBoost models; evaluation of time-bound and survival-based formulations; and quantification of SDOH impact using feature subset XGBoost models and gain-based importance.
Results:
For 10-year CVD prediction, the XGBoost binary model outperformed both Cox proportional hazards (AUC = 0.748 vs. 0.731) and SCORE2 (AUC = 0.648; P < .001). In overall event prediction, XGBoost also achieved the highest AUC (0.731), significantly better than Cox (AUC = 0.697; P < .001). For 5-year prediction, the combined XGBoost model (medical + social features) reached an AUC of 0.734, significantly higher than the medical-only model (AUC = 0.725; P < .001), and the social-only model (AUC = 0.679; P < .001). Income-related variables were among the top features in the combined model, with gains comparable to core biomedical predictors. Feature gain analysis showed that social determinants meaningfully supplement biomedical features, especially when used together. While medical features contributed more overall (total gain = 0.6066), social features added complementary value (gain = 0.2649), particularly income variables.
Conclusions:
This study quantifies the added value of SDOH in predicting cardiometabolic disease progression. Using linked medical and socioeconomic data, we show that while biomedical factors dominate, income-related SDOH significantly enhance predictive performance, highlighting their complementary role in personalised risk assessment and model development. Clinical Trial: Not applicable. This study did not involve a randomized controlled trial.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.