Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Jun 12, 2025
Date Accepted: Mar 13, 2026

The final, peer-reviewed published version of this preprint can be found here:

Advancing Gastrointestinal Cancer Risk Prediction With Patient-Centered Machine Learning: Machine Learning Modeling Study

Baublyte D, Lee J, Gunathilake M, Kim J

Advancing Gastrointestinal Cancer Risk Prediction With Patient-Centered Machine Learning: Machine Learning Modeling Study

JMIR Med Inform 2026;14:e78931

DOI: 10.2196/78931

PMID: 42241216

Advancing Gastrointestinal Cancer Risk Prediction with Patient-Centered Machine Learning: Data and Algorithmic Strategies in a Prospective Cohort

  • Daina Baublyte; 
  • Jeonghee Lee; 
  • Madhawa Gunathilake; 
  • Jeongseon Kim

ABSTRACT

Background:

Gastrointestinal (GI) cancers are a significant health concern, and early detection is crucial for improving patient outcomes. However, the rarity of these diseases leads to severe class imbalances in datasets, posing challenges for machine learning (ML)–based model development.

Objective:

This study aimed to evaluate various data-driven methods to address class imbalance in ML-based predictive models and enhance early GI cancer prognosis in a prospective cohort.

Methods:

We analyzed a prospective cohort of 7,482 individuals, comprising 158 GI cancer cases (2%) and 7,324 controls (98%). To mitigate class imbalance, we developed a novel patient-centered under-sampling technique (PCUSTe) and compared its performance against synthetic minority over-sampling, adaptive synthetic sampling, and hybrid methods. We implemented various ML algorithms and systematically evaluated 468 unique model configurations, combining ML algorithms, resampling methods, and hyperparameter selections. Model performance was assessed using area under the curve (AUC), Matthew’s correlation coefficient (MCC), and Brier score.

Results:

The top six models, selected based on all evaluation metrics, achieved an average AUC of 0.77 (95% CI 0.76–0.78), MCC of 0.44 (95% CI 0.42–0.45), and Brier score of 0.22 (95% CI 0.21–0.22). The best-performing model—a stochastic gradient descent classifier trained on PCUSTe dataset—achieved the highest MCC of 0.52 (95% CI 0.50–0.55).

Conclusions:

Our findings demonstrate that advanced resampling techniques, particularly PCUSTe, substantially enhance predictive accuracy in GI cancer risk modeling. These improvements have the potential to support earlier detection and personalized risk stratification in clinical settings, even for rare diseases with severely imbalanced data.


 Citation

Please cite as:

Baublyte D, Lee J, Gunathilake M, Kim J

Advancing Gastrointestinal Cancer Risk Prediction With Patient-Centered Machine Learning: Machine Learning Modeling Study

JMIR Med Inform 2026;14:e78931

DOI: 10.2196/78931

PMID: 42241216

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.