Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Oct 29, 2025
Date Accepted: Jun 3, 2026
Leveraging Large Language Models to Integrate Clinical Knowledge and Machine Learning Predictions for Lymph Node Metastasis Prediction: A Knowledge-Augmented Framework
ABSTRACT
Background:
Lymph node metastasis (LNM) is a critical clinical indicator for determining the initial treatment strategy of lung cancer patients. However, accurately diagnosing LNM preoperatively remains a significant challenge. Data-driven predictive modeling has become a mainstream approach to address this issue, yet it often overlooks existing clinical knowledge.
Objective:
Large language models (LLMs) have demonstrated the potential to predict clinical risks in a zero-shot manner based on the extensive clinical knowledge learned from large-scale corpora. This study aims to investigate the integration of LLM-derived knowledge with data-driven patterns to enhance the accuracy of LNM prediction.
Methods:
We propose a novel ensemble framework that combines the strengths of LLMs and machine learning (ML) models for LNM prediction in lung cancer. Specifically, three ML models were trained using clinical data, and their predicted probabilities, along with the original clinical features, were incorporated into prompts for LLMs. Three LLMs, GPT-4o, GPT-o4-mini, and DeepSeek-R1, were employed to independently predict LNM risk five times, and four ensemble strategies were applied to aggregate their predictions into a final outcome.
Results:
The proposed approach was evaluated on clinical data from 767 lung cancer patients at Peking University Cancer Hospital. Experimental results show that our ensemble framework significantly outperforms standalone ML models, achieving an area under the curve (AUC) of 0.778 and an average precision (AP) of 0.418. Furthermore, reasoning-oriented LLMs achieved better performance than base chat LLMs within the ensemble framework.
Conclusions:
This study presents a concise and effective strategy for integrating the clinical knowledge embedded in LLMs with the latent data–outcome relationships captured by ML models, offering a promising direction for improving LNM prediction of lung cancer.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.