Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Nov 28, 2025
Date Accepted: Apr 10, 2026

The final, peer-reviewed published version of this preprint can be found here:

Benchmarking Large Language Models and Prompt Engineering Strategies in Microsatellite Instability Cancers: Evaluation Study

Zhang Y, Song J, Bi C, Zheng X, Xu Z, Cao D, Shen B

Benchmarking Large Language Models and Prompt Engineering Strategies in Microsatellite Instability Cancers: Evaluation Study

J Med Internet Res 2026;28:e88614

DOI: 10.2196/88614

PMID: 42166792

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Benchmarking LLMs and Prompt Engineering Strategies for Consensus and Frontier Knowledge in Microsatellite Instability Cancers

  • Yuxin Zhang; 
  • Jie Song; 
  • Cheng Bi; 
  • Xin Zheng; 
  • Zhichuan Xu; 
  • Dan Cao; 
  • Bairong Shen

ABSTRACT

Background:

The reliability of general-purpose Large Language Models (LLMs) for complex clinical tasks in specialized domains like microsatellite instability (MSI) cancers remains critically uncharacterized. The absence of a domain-specific benchmark to evaluate and guide the optimization of their capabilities across diverse clinical tasks poses unevaluated risks to patient safety.

Objective:

The primary objective was to develop and validate MSIC-Bench, a novel, two-tiered benchmark for MSI cancer covering both consensus and frontier knowledge. Using this framework, we aimed to systematically assess LLMs performance across various prompting strategies, identify task-specific weaknesses, and reveal effective pathways for performance improvement.

Methods:

We developed MSIC-Bench, a 500-question benchmark derived from clinical guidelines and a curated knowledge base. Three state-of-the-art LLMs (GPT-4o, Gemini 2.5 Pro, and Claude Opus 4) were evaluated using four prompting strategies, including vanilla, Chain-of-Thought (CoT), Reflection of Thoughts (RoT), and Retrieval-Augmented Generation (RAG), under both multiple-choice and open-ended modalities. Performance was assessed on accuracy, safety (honesty), error composition, and token usage.

Results:

A significant 'scaffolding effect' was observed, with the average LLMs accuracy dropping from 89.81% in multiple-choice formats to 76.56% in open-ended scenarios. Our task-specific analysis revealed this decline was most pronounced in complex therapeutic decision-making tasks. Error analysis attributed failures in non-RAG models primarily to insufficient domain knowledge (55.51% of errors), manifesting as a high frequency of unsafe fabrication. The integration of RAG proved highly effective, substantially improving accuracy in these critical tasks (e.g., boosting claude's performance from 76.8% to 90.4%) and inducing a crucial shift towards safety by increasing explicit statements of uncertainty (from 6.70% to 16.55% on average, and up to 75% in specific cases). Notably, these gains were achieved with significantly lower token usage (RAG: 115 tokens vs. CoT: 398 and RoT: 613 tokens on average for GPT-4o).

Conclusions:

Our comprehensive evaluation reveals that LLMs lack the specialized domain knowledge required for complex MSI cancer-related tasks, rather than suffering from reasoning deficits. Prompting strategies substantially influence LLMs accuracy, safety, and token usage, with RAG emerging as the most effective and reliable method for improving both accuracy and safety. Ultimately, MSIC-Bench provides not only a comprehensive resource for systematic evaluation and optimization of LLMs in the MSI cancer domain, but its two-tiered design also offers a replicable blueprint for developing similar benchmarks in other knowledge-intensive medical fields.


 Citation

Please cite as:

Zhang Y, Song J, Bi C, Zheng X, Xu Z, Cao D, Shen B

Benchmarking Large Language Models and Prompt Engineering Strategies in Microsatellite Instability Cancers: Evaluation Study

J Med Internet Res 2026;28:e88614

DOI: 10.2196/88614

PMID: 42166792

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.