Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Cancer

Date Submitted: Jul 19, 2024
Date Accepted: Apr 30, 2025

The final, peer-reviewed published version of this preprint can be found here:

Predicting Early-Onset Colorectal Cancer in Individuals Below Screening Age Using Machine Learning and Real-World Data: Case Control Study

Sun C, Mobley E, Quillen M, Parker M, Daly M, Wang R, Visintin I, Awad Z, Fishe J, Parker A, George T, Bian J, Xu J

Predicting Early-Onset Colorectal Cancer in Individuals Below Screening Age Using Machine Learning and Real-World Data: Case Control Study

JMIR Cancer 2025;11:e64506

DOI: 10.2196/64506

PMID: 40537065

PMCID: 12200807

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Predicting Early-Onset Colorectal Cancer in Individuals Below Screening Age Using Machine Learning and Real-World Data

  • Chengkun Sun; 
  • Erin Mobley; 
  • Michael Quillen; 
  • Max Parker; 
  • Meghan Daly; 
  • Rui Wang; 
  • Isabela Visintin; 
  • Ziad Awad; 
  • Jennifer Fishe; 
  • Alexander Parker; 
  • Thomas George; 
  • Jiang Bian; 
  • Jie Xu

ABSTRACT

Background:

Colorectal cancer (CRC) is now the leading cause of cancer-related deaths among young Americans. Accurate early prediction and a thorough understanding of the risk factors for early-onset colorectal cancer (EOCRC) are vital for effective prevention and treatment, particularly for patients below the recommended screening age.

Objective:

Our study aims to predict EOCRC using machine learning (ML) and structured electronic health record (EHR) data for individuals under the screening age of 45.

Methods:

We identified a cohort of patients under 45 from the OneFlorida+ Clinical Research Consortium. Given the distinct pathology of colon cancer (CC) and rectal cancer (RC), we created separate prediction models for each cancer type with various ML algorithms. We assessed multiple prediction time windows (0, 1, 3, and 5 years) and ensured robustness through propensity score matching (PSM) to account for confounding variables. Model performance was assessed using established metrics. Additionally, we employed the Shapley Additive exPlanations (SHAP) to identify risk factors for EOCRC.

Results:

Our study yielded results, with Area Under the Curve (AUC) scores of 0.811, 0.748, 0.689, and 0.686 for CC prediction, and 0.829, 0.771, 0.727, and 0.721 for RC prediction at 0, 1, 3, and 5 years, respectively. Notably, predictors included immune and digestive system disorders, along with secondary cancers and underweight, prevalent in both CC and RC groups. Blood diseases emerged as prominent indicators of CC.

Conclusions:

This study highlights the potential of ML techniques in leveraging EHR data to predict EOCRC, offering valuable insights for potential early diagnosis in patients who are below the recommended screening age.


 Citation

Please cite as:

Sun C, Mobley E, Quillen M, Parker M, Daly M, Wang R, Visintin I, Awad Z, Fishe J, Parker A, George T, Bian J, Xu J

Predicting Early-Onset Colorectal Cancer in Individuals Below Screening Age Using Machine Learning and Real-World Data: Case Control Study

JMIR Cancer 2025;11:e64506

DOI: 10.2196/64506

PMID: 40537065

PMCID: 12200807

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.