Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Currently submitted to: JMIR Cancer

Date Submitted: May 12, 2026
Open Peer Review Period: May 28, 2026 - Jul 23, 2026
(currently open for review)

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

An Interpretable Machine Learning Model for Preoperative Assessment of LNM Risk in Colorectal Cancer Using Routinely Collected Clinical Data: Development and External Validation in a Multicenter Retrospective Study

  • Gehui Xu; 
  • Xiaohe Sun; 
  • Jiaxin Jiang; 
  • Jin Sun; 
  • Liu Li; 
  • Haibo Cheng; 
  • Yuekun Zhu; 
  • Haiyi Liu

ABSTRACT

Background:

Preoperative assessment of lymph node metastasis (LNM) risk is needed to support risk stratification and individualized treatment planning in patients with colorectal cancer (CRC). However, imaging-based nodal staging may be affected by image quality and readers’ experience, and its ability to detect microscopic metastatic disease remains limited. Machine learning models based on routinely available clinical data may provide an accessible and interpretable approach to support individualized preoperative decision-making.

Objective:

This study aimed to establish an interpretable machine learning–based model to estimate the preoperative risk of LNM risk in patients with CRC. The model incorporated preoperative variables that are routinely obtained in clinical practice and was further evaluated in an independent external validation set.

Methods:

We retrospectively analyzed data from 2725 patients diagnosed with CRC at two independent hospitals. The internal cohort was randomly split at a 7:3 ratio for model training and testing. The second-center cohort served as the independent external validation set. The outcome was pathologically confirmed regional LNM. Candidate variables included demographic characteristics, laboratory indicators, tumor markers, and tumor-related clinicopathological features available before surgery. Variables independently associated with LNM were identified using logistic regression analyses. Seven machine learning models were constructed using LightGBM, random forest, support vector machine, logistic regression, decision tree, XGBoost, and naive Bayes. Model performance was checked by discrimination, calibration, clinical utility, and classification metrics. We used the area under the receiver operating characteristic curve (AUC) to assess discrimination. Accuracy, sensitivity, specificity, F1 score, positive predictive value (PPV), and negative predictive value (NPV) described classification performance. Calibration curves compared predicted risks with observed outcomes. Decision curve analysis estimated the model’s net clinical benefit. SHapley Additive exPlanations (SHAP) analysis interpreted the selected model and assessed predictor contributions.

Results:

The final cohort included 2725 patients. There were 753 patients for model training, 321 for testing, and 1651 for external validation. In multivariable logistic regression, body mass index, preoperative carcinoembryonic antigen level, primary tumor site, clinical T stage, histological type, and tumor differentiation were independently associated with LNM. Among the seven models, random forest showed the most balanced performance. In the test set, this model had an AUC of 0.806. Its accuracy was 0.735, sensitivity was 0.737, and specificity was 0.734. In the external validation set, the AUC was 0.782. Accuracy, sensitivity, and specificity were 0.690, 0.661, and 0.708.

Conclusions:

An interpretable machine learning model estimated LNM risk in CRC with acceptable performance. Random forest showed stable discrimination in the independent external validation set. It may support individualized preoperative risk stratification, but prospective validation and implementation studies are still needed. Clinical Trial: Not applicable


 Citation

Please cite as:

Xu G, Sun X, Jiang J, Sun J, Li L, Cheng H, Zhu Y, Liu H

An Interpretable Machine Learning Model for Preoperative Assessment of LNM Risk in Colorectal Cancer Using Routinely Collected Clinical Data: Development and External Validation in a Multicenter Retrospective Study

JMIR Preprints. 12/05/2026:101162

DOI: 10.2196/preprints.101162

URL: https://preprints.jmir.org/preprint/101162

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.