JMIR Preprints #95583: Prompt Sensitivity and Answer Consistency in Clinical Question Answering Using Small Open-Source Language Models: Empirical Evaluation for Low-Resource Healthcare

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Prompt Sensitivity and Answer Consistency in Clinical Question Answering Using Small Open-Source Language Models: Empirical Evaluation for Low-Resource Healthcare

Shravani Hariprasad

ABSTRACT

Background:

Artificial intelligence is increasingly deployed in healthcare workflows, and small open-source language models are gaining attention as viable tools for low-resource settings where cloud infrastructure is unavailable. Despite their growing accessibility, the reliability of these models, particularly the stability of their outputs under different phrasings of the same clinical question, remains poorly understood.

Objective:

This study systematically evaluates prompt sensitivity and answer consistency in small open-source language models on clinical question answering benchmarks, with implications for low-resource healthcare deployment.

Methods:

Five open-source language models spanning distinct architectural and training paradigms (Phi-3 Mini, Llama 3.2, Gemma 2, Mistral 7B, and Meditron-7B) were evaluated across three clinical question answering datasets (MedQA, MedMCQA, PubMedQA) using five controlled prompt style variations, yielding 15,000 total inference calls conducted locally on consumer CPU hardware without fine-tuning. Consistency scores, accuracy, and instruction-following failure rates were measured and interpreted in the context of each model's training and architectural design.

Results:

Consistency and accuracy were largely independent across models and datasets. Gemma 2 achieved the highest consistency scores (0.845 to 0.888) but the lowest accuracy (33.0 to 43.5%), producing perfectly consistent yet incorrect answers on 77 of 200 MedQA questions (38.5%), a failure mode termed reliable incorrectness. Llama 3.2 demonstrated moderate consistency (0.774 to 0.807) alongside the highest accuracy (49.0 to 65.0%). Roleplay prompts consistently reduced accuracy across all models and datasets, with Phi-3 Mini showing the largest decline of 21.5 percentage points on MedQA. Instruction-following failure rates varied by model and were not determined by parameter count, with Phi-3 Mini exhibiting the highest UNKNOWN rate at 10.5% on MedQA. Meditron-7B, a domain-pretrained model without instruction tuning, exhibited near-complete instruction-following failure on PubMedQA (99.0% UNKNOWN rate), demonstrating that domain knowledge alone is insufficient for structured clinical question answering.

Conclusions:

High consistency does not imply correctness in small clinical language models; models can be reliably incorrect, representing a potentially dangerous failure mode in clinical decision support. Roleplay prompt styles may reduce reliability in healthcare AI applications. Among the models evaluated, Llama 3.2 demonstrated the strongest balance of accuracy and reliability for low-resource deployment. These findings highlight the necessity of multidimensional evaluation frameworks that assess consistency, accuracy, and instruction adherence jointly for safe clinical AI deployment.

Citation

Please cite as:

Hariprasad S

Prompt Sensitivity and Answer Consistency in Clinical Question Answering Using Small Open-Source Language Models: Empirical Evaluation for Low-Resource Healthcare

JMIR Preprints. 17/03/2026:95583

DOI: 10.2196/preprints.95583

URL: https://preprints.jmir.org/preprint/95583

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Currently submitted to: JMIR Medical Informatics

Date Submitted: Mar 17, 2026

Open Peer Review Period: Apr 8, 2026 - Jun 3, 2026

(currently open for review)

Prompt Sensitivity and Answer Consistency in Clinical Question Answering Using Small Open-Source Language Models: Empirical Evaluation for Low-Resource Healthcare

ABSTRACT

Citation

Copyright