Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Jun 12, 2025
Date Accepted: Mar 13, 2026
Advancing Gastrointestinal Cancer Risk Prediction with Patient-Centered Machine Learning: Data and Algorithmic Strategies in a Prospective Cohort
ABSTRACT
Background:
Gastrointestinal (GI) cancers are a significant health concern, and early detection is crucial for improving patient outcomes. However, the rarity of these diseases leads to severe class imbalances in datasets, posing challenges for machine learning (ML)–based model development.
Objective:
This study aimed to evaluate various data-driven methods to address class imbalance in ML-based predictive models and enhance early GI cancer prognosis in a prospective cohort.
Methods:
We analyzed a prospective cohort of 7,482 individuals, comprising 158 GI cancer cases (2%) and 7,324 controls (98%). To mitigate class imbalance, we developed a novel patient-centered under-sampling technique (PCUSTe) and compared its performance against synthetic minority over-sampling, adaptive synthetic sampling, and hybrid methods. We implemented various ML algorithms and systematically evaluated 468 unique model configurations, combining ML algorithms, resampling methods, and hyperparameter selections. Model performance was assessed using area under the curve (AUC), Matthew’s correlation coefficient (MCC), and Brier score.
Results:
The top six models, selected based on all evaluation metrics, achieved an average AUC of 0.77 (95% CI 0.76–0.78), MCC of 0.44 (95% CI 0.42–0.45), and Brier score of 0.22 (95% CI 0.21–0.22). The best-performing model—a stochastic gradient descent classifier trained on PCUSTe dataset—achieved the highest MCC of 0.52 (95% CI 0.50–0.55).
Conclusions:
Our findings demonstrate that advanced resampling techniques, particularly PCUSTe, substantially enhance predictive accuracy in GI cancer risk modeling. These improvements have the potential to support earlier detection and personalized risk stratification in clinical settings, even for rare diseases with severely imbalanced data.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.