JMIR Preprints #78519: CanRisk-RAG: A Knowledge-guided and Explainable Recommendation Tool for Cancer Risk Prediction Models based on Retrieval-Augmented Large Language Models

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

CanRisk-RAG: A Knowledge-guided and Explainable Recommendation Tool for Cancer Risk Prediction Models based on Retrieval-Augmented Large Language Models

Shumin Ren;
Xin Zheng;
Jing Zhao;
Jiale Du;
Yuxin Zhang;
Cheng Bi;
Jie Song;
Jinyi Zhang;
Hongmei Lang;
Zhang Fan;
Bairong Shen

ABSTRACT

Background:

Early prevention and screening are essential for effective cancer management. Cancer risk prediction models play a vital role in identifying high-risk individuals. Despite decades of research and the development of hundreds of models, their adoption in clinical and research settings remains limited due to the fragmented nature of the literature and the difficulty in selecting context-appropriate models. Existing platforms like PubMed and general-purpose large language models (LLMs) are inadequate for domain-specific queries and model discovery.

Objective:

This study aims to develop CanRisk-RAG, a retrieval-augmented, knowledge-driven recommendation platform that supports personalized and evidence-grounded decision-making in cancer risk model selection.

Methods:

CanRisk-RAG is built upon a structured knowledgebase containing 802 prediction models spanning over 10 cancer types, 13 modeling approaches, and 3,095 predictive variables. Model information is embedded into a vectorized database using high-performance language embeddings. User queries are processed by DeepSeek-V3 to extract semantic tags for targeted retrieval via a FAISS-based vector index. A multi-factor ranking algorithm combines semantic similarity, journal impact factor, model AUC, and publication recency to prioritize results. Each recommended model is presented with 36 data fields and an LLM-generated literature summary to enhance interpretability. System performance was evaluated across 12 task-specific queries and five assessment dimensions (relevance, reliability, authenticity, data completeness, and consistency), with comparison against three LLMs (ChatGPT-4o, Scholar AI, Gemini 1.5 Flash) and PubMed. Usability testing was conducted using three evaluation metrics.

Results:

CanRisk-RAG significantly outperformed all comparator tools in all evaluated dimensions. On a 10-point scale, it achieved particularly high relevance (9.50 ± 0.86) and reliability (8.75 ± 1.26) scores compared to Scholar AI (5.38 ± 2.63, 4.42 ± 2.55), ChatGPT-4o (2.67 ± 2.24, 1.58 ± 1.25), Gemini 1.5 Flash (1.54 ± 0.78, 1.54 ± 0.98), and PubMed (2.33 ± 2.39, 4.46 ± 3.40) (P < 0.05). The system demonstrated strong capability in handling complex, multi-factorial queries and consistently delivered accurate, well-contextualized results. Usability testing further confirmed the tool’s accessibility and value in research workflows.

Conclusions:

CanRisk-RAG offers a scalable, explainable, and user-friendly solution to bridge the gap between cancer risk model development and real-world application. Its integration of structured biomedical knowledge with LLM-enhanced semantic search and multi-criteria ranking provides a new paradigm for model recommendation. This framework can be extended to other biomedical domains to support advanced, domain-specific information retrieval and decision-making. Clinical Trial: Not applicable

Citation

Please cite as:

Ren S, Zheng X, Zhao J, Du J, Zhang Y, Bi C, Song J, Zhang J, Lang H, Fan Z, Shen B

Knowledge-Guided Explainable Recommendation Tool for Cancer Risk Prediction Models Using Retrieval-Augmented Large Language Models: Development and Validation Study

JMIR Med Inform 2026;14:e78519

DOI: 10.2196/78519

PMID: 41813328

PMCID: 12978927

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Jun 4, 2025

Date Accepted: Jan 29, 2026

CanRisk-RAG: A Knowledge-guided and Explainable Recommendation Tool for Cancer Risk Prediction Models based on Retrieval-Augmented Large Language Models

ABSTRACT

Citation

Copyright