Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Mar 11, 2025
Date Accepted: Jul 30, 2025
Predicting Lymph Node Metastasis in Rectal Cancer: Development and Validation of a Machine Learning Model Using Clinical Data
ABSTRACT
Background:
Rectal cancer (RC) is a common malignant tumor with lymph node metastasis (LNM) being a critical determinant of patient prognosis. Traditional diagnostic methods have limitations, necessitating the development of predictive models using clinical data.
Objective:
This study aimed to construct and validate machine learning models to predict LNM risk in RC patients based on clinical data.
Methods:
Retrospective data from 2,454 RC patients (SEER database) were split into training (n=1,954) and internal validation (n=500) sets. An external cohort (n=500) was obtained from the First Affiliated Hospital of Anhui Medical University. Lymph node features identified via CT scans were integrated with clinicopathological data. Variables were selected using LASSO, followed by univariate and multivariate logistic regression. Eleven ML models (LR Logistic Regression, KNN K - Nearest Neighbors, ET Extremely Randomized Trees, NB Naive Bayes, XGB XGBoost, LGBM LightGBM, MLP Multi - Layer Perceptron, GB Gradient Boosting, SVM Support Vector Machine, RF Random Forest, AB Ada – Boost) were evaluated via AUC, calibration curves, and decision curve analysis (DCA).
Results:
LNM prevalence was 26.9% (training), 27% (internal validation), and 81% (external validation). Independent LNM predictors included tumor grade, clinical T stage, N stage, tumor length, neural invasion, and total lymph nodes. Internal validation AUC ranged 0.859–0.964; external validation AUC was 0.735-0.838. In the internal validation set, RF and ET achieved the highest AUC (0.964, 95%CI: 0.950–0.978), while XGBoost demonstrated superior cross-cohort stability (AUC=0.942, 95%CI: 0.925–0.959). For external validation, GB had the highest AUC (0.838, 95%CI: 0.801–0.875), followed by XGBoost (0.832, 95%CI: 0.794–0.869). XGBoost showed minimal calibration error with curves closest to the ideal diagonal and yielded the highest net benefit in DCA across critical thresholds.
Conclusions:
This study successfully developed and validated 11 ML models to predict LNM risk in RC. The XGBoost model was optimal, achieving AUC > 0.9 in 10 internal models and AUC > 0.8 in 7 external models.The identified predictors of LNM can facilitate early diagnosis and personalized treatment, highlighting the potential of integrating CT scan data with clinicopathological findings to build effective predictive models. Clinical Trial: Trial Registration: chictr.org.cn ChiCTR2400094858
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.