Explainable Artificial Intelligence for Predicting Mortality Risk in Metastatic Cancer: Retrospective Cohort Study Using the MSK-MET Dataset
ABSTRACT
Background:
Cancer remains a leading global health challenge and a major cause of mortality. In particular, metastatic disease significantly reduces survival rates. Advances in machine learning (ML) have opened new avenues to integrate diverse clinical and genomic information, enabling more accurate predictions of patient outcomes.
Objective:
The objective of the study was to leverage advanced machine learning (ML) methods, specifically XGBoost and explainable AI (SHAP analysis), to predict survivability in cancer patients based on their metastatic patterns, clinical outcomes, and genomic characteristics. The research focused on evaluating and comparing multiple ML models, enhancing model interpretability, and providing actionable clinical insights for personalized patient prognosis and treatment planning.
Methods:
We utilized data from 20,338 patients (after cleaning) in the MSK-MET dataset, comprising 27 cancer types. Pre-processing eliminated high-missingness variables and normalized continuous features. We then split data into training (80%) and test (20%) sets via stratified random sampling. Five ML models XGBoost, Naïve Bayes, Decision Tree, Logistic Regression, and Random Forest underwent hyperparameter tuning using grid search with five-fold cross-validation. Model performance was assessed through accuracy and area under the receiver operating characteristic curve (AUC). The best-performing model, XGBoost, was further analyzed with SHapley Additive exPlanations (SHAP) to identify pivotal features. Subsequent survival analysis with Kaplan-Meier curves, Cox Proportional Hazards models, and XGBoost Survival Analysis investigated how key predictors (e.g., metastatic burden, tumor mutation burden) affected overall survival.
Results:
XGBoost outperformed other models, achieving 74% accuracy and an AUC of 0.82. SHAP values highlighted metastatic site count, tumor mutation burden, fraction of genome altered, and organ-specific metastases (particularly liver and bone) as major contributors to model predictions. Kaplan-Meier analyses showed significantly lower survival probabilities in metastatic compared to non-metastatic patients. Cox modeling revealed higher hazard ratios for patients with increased metastatic site counts (HR > 1.0, p<0.005), underscoring the heightened mortality risk. An adapted XGBoost Survival Analysis captured non-linear effects, improving the concordance index to 0.70 and confirming these features as strong predictors of adverse outcomes.
Conclusions:
This study demonstrates the value of combining clinical and genomic data with ML to predict survival outcomes in metastatic cancers. XGBoost, enhanced by SHAP-driven explainability, accurately stratified high-risk patients and pinpointed pivotal predictors of poor prognosis. Kaplan-Meier and Cox analyses substantiated the association between metastatic burden and reduced survival, while the survival-adapted XGBoost model refined risk differentiation. These integrated findings can guide personalized treatment plans, inform resource allocation, and ultimately enhance patient care.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.