JMIR Preprints #101162: An Interpretable Machine Learning Model for Preoperative Assessment of LNM Risk in Colorectal Cancer Using Routinely Collected Clinical Data: Development and External Validation in a Multicenter Retrospective Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

An Interpretable Machine Learning Model for Preoperative Assessment of LNM Risk in Colorectal Cancer Using Routinely Collected Clinical Data: Development and External Validation in a Multicenter Retrospective Study

Gehui Xu;
Xiaohe Sun;
Jiaxin Jiang;
Jin Sun;
Liu Li;
Haibo Cheng;
Yuekun Zhu;
Haiyi Liu

ABSTRACT

Background:

Preoperative assessment of lymph node metastasis (LNM) risk is needed to support risk stratification and individualized treatment planning in patients with colorectal cancer (CRC). However, imaging-based nodal staging may be affected by image quality and readers’ experience, and its ability to detect microscopic metastatic disease remains limited. Machine learning models based on routinely available clinical data may provide an accessible and interpretable approach to support individualized preoperative decision-making.

Objective:

This study aimed to establish an interpretable machine learning–based model to estimate the preoperative risk of LNM risk in patients with CRC. The model incorporated preoperative variables that are routinely obtained in clinical practice and was further evaluated in an independent external validation set.

Methods:

We retrospectively analyzed data from 2725 patients diagnosed with CRC at two independent hospitals. The internal cohort was randomly split at a 7:3 ratio for model training and testing. The second-center cohort served as the independent external validation set. The outcome was pathologically confirmed regional LNM. Candidate variables included demographic characteristics, laboratory indicators, tumor markers, and tumor-related clinicopathological features available before surgery. Variables independently associated with LNM were identified using logistic regression analyses. Seven machine learning models were constructed using LightGBM, random forest, support vector machine, logistic regression, decision tree, XGBoost, and naive Bayes. Model performance was checked by discrimination, calibration, clinical utility, and classification metrics. We used the area under the receiver operating characteristic curve (AUC) to assess discrimination. Accuracy, sensitivity, specificity, F1 score, positive predictive value (PPV), and negative predictive value (NPV) described classification performance. Calibration curves compared predicted risks with observed outcomes. Decision curve analysis estimated the model’s net clinical benefit. SHapley Additive exPlanations (SHAP) analysis interpreted the selected model and assessed predictor contributions.

Results:

The final cohort included 2725 patients. There were 753 patients for model training, 321 for testing, and 1651 for external validation. In multivariable logistic regression, body mass index, preoperative carcinoembryonic antigen level, primary tumor site, clinical T stage, histological type, and tumor differentiation were independently associated with LNM. Among the seven models, random forest showed the most balanced performance. In the test set, this model had an AUC of 0.806. Its accuracy was 0.735, sensitivity was 0.737, and specificity was 0.734. In the external validation set, the AUC was 0.782. Accuracy, sensitivity, and specificity were 0.690, 0.661, and 0.708.

Conclusions:

An interpretable machine learning model estimated LNM risk in CRC with acceptable performance. Random forest showed stable discrimination in the independent external validation set. It may support individualized preoperative risk stratification, but prospective validation and implementation studies are still needed. Clinical Trial: Not applicable

Citation

Please cite as:

Xu G, Sun X, Jiang J, Sun J, Li L, Cheng H, Zhu Y, Liu H

An Interpretable Machine Learning Model for Preoperative Assessment of LNM Risk in Colorectal Cancer Using Routinely Collected Clinical Data: Development and External Validation in a Multicenter Retrospective Study

JMIR Preprints. 12/05/2026:101162

DOI: 10.2196/preprints.101162

URL: https://preprints.jmir.org/preprint/101162

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Currently submitted to: JMIR Cancer

Date Submitted: May 12, 2026

Open Peer Review Period: May 28, 2026 - Jul 23, 2026

(closed for review but you can still tweet)

NOTE: This is an unreviewed Preprint

An Interpretable Machine Learning Model for Preoperative Assessment of LNM Risk in Colorectal Cancer Using Routinely Collected Clinical Data: Development and External Validation in a Multicenter Retrospective Study

ABSTRACT

Citation

Copyright