JMIR Preprints #88614: Benchmarking LLMs and Prompt Engineering Strategies for Consensus and Frontier Knowledge in Microsatellite Instability Cancers

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Benchmarking LLMs and Prompt Engineering Strategies for Consensus and Frontier Knowledge in Microsatellite Instability Cancers

Yuxin Zhang;
Jie Song;
Cheng Bi;
Xin Zheng;
Zhichuan Xu;
Dan Cao;
Bairong Shen

ABSTRACT

Background:

The reliability of general-purpose Large Language Models (LLMs) for complex clinical tasks in specialized domains like microsatellite instability (MSI) cancers remains critically uncharacterized. The absence of a domain-specific benchmark to evaluate and guide the optimization of their capabilities across diverse clinical tasks poses unevaluated risks to patient safety.

Objective:

The primary objective was to develop and validate MSIC-Bench, a novel, two-tiered benchmark for MSI cancer covering both consensus and frontier knowledge. Using this framework, we aimed to systematically assess LLMs performance across various prompting strategies, identify task-specific weaknesses, and reveal effective pathways for performance improvement.

Methods:

We developed MSIC-Bench, a 500-question benchmark derived from clinical guidelines and a curated knowledge base. Three state-of-the-art LLMs (GPT-4o, Gemini 2.5 Pro, and Claude Opus 4) were evaluated using four prompting strategies, including vanilla, Chain-of-Thought (CoT), Reflection of Thoughts (RoT), and Retrieval-Augmented Generation (RAG), under both multiple-choice and open-ended modalities. Performance was assessed on accuracy, safety (honesty), error composition, and token usage.

Results:

A significant 'scaffolding effect' was observed, with the average LLMs accuracy dropping from 89.81% in multiple-choice formats to 76.56% in open-ended scenarios. Our task-specific analysis revealed this decline was most pronounced in complex therapeutic decision-making tasks. Error analysis attributed failures in non-RAG models primarily to insufficient domain knowledge (55.51% of errors), manifesting as a high frequency of unsafe fabrication. The integration of RAG proved highly effective, substantially improving accuracy in these critical tasks (e.g., boosting claude's performance from 76.8% to 90.4%) and inducing a crucial shift towards safety by increasing explicit statements of uncertainty (from 6.70% to 16.55% on average, and up to 75% in specific cases). Notably, these gains were achieved with significantly lower token usage (RAG: 115 tokens vs. CoT: 398 and RoT: 613 tokens on average for GPT-4o).

Conclusions:

Our comprehensive evaluation reveals that LLMs lack the specialized domain knowledge required for complex MSI cancer-related tasks, rather than suffering from reasoning deficits. Prompting strategies substantially influence LLMs accuracy, safety, and token usage, with RAG emerging as the most effective and reliable method for improving both accuracy and safety. Ultimately, MSIC-Bench provides not only a comprehensive resource for systematic evaluation and optimization of LLMs in the MSI cancer domain, but its two-tiered design also offers a replicable blueprint for developing similar benchmarks in other knowledge-intensive medical fields.

Citation

Please cite as:

Zhang Y, Song J, Bi C, Zheng X, Xu Z, Cao D, Shen B

Benchmarking Large Language Models and Prompt Engineering Strategies in Microsatellite Instability Cancers: Evaluation Study

J Med Internet Res 2026;28:e88614

DOI: 10.2196/88614

PMID: 42166792

PMCID: 13193672

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Nov 28, 2025

Date Accepted: Apr 10, 2026

Benchmarking LLMs and Prompt Engineering Strategies for Consensus and Frontier Knowledge in Microsatellite Instability Cancers

ABSTRACT

Citation

Copyright