Currently submitted to: JMIR Medical Informatics
Date Submitted: Mar 17, 2026
Open Peer Review Period: Apr 8, 2026 - Jun 3, 2026
(currently open for review)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Prompt Sensitivity and Answer Consistency in Clinical Question Answering Using Small Open-Source Language Models: Empirical Evaluation for Low-Resource Healthcare
ABSTRACT
Background:
Artificial intelligence is increasingly deployed in healthcare workflows, and small open-source language models are gaining attention as viable tools for low-resource settings where cloud infrastructure is unavailable. Despite their growing accessibility, the reliability of these models, particularly the stability of their outputs under different phrasings of the same clinical question, remains poorly understood.
Objective:
This study systematically evaluates prompt sensitivity and answer consistency in small open-source language models on clinical question answering benchmarks, with implications for low-resource healthcare deployment.
Methods:
Five open-source language models spanning distinct architectural and training paradigms (Phi-3 Mini, Llama 3.2, Gemma 2, Mistral 7B, and Meditron-7B) were evaluated across three clinical question answering datasets (MedQA, MedMCQA, PubMedQA) using five controlled prompt style variations, yielding 15,000 total inference calls conducted locally on consumer CPU hardware without fine-tuning. Consistency scores, accuracy, and instruction-following failure rates were measured and interpreted in the context of each model's training and architectural design.
Results:
Consistency and accuracy were largely independent across models and datasets. Gemma 2 achieved the highest consistency scores (0.845 to 0.888) but the lowest accuracy (33.0 to 43.5%), producing perfectly consistent yet incorrect answers on 77 of 200 MedQA questions (38.5%), a failure mode termed reliable incorrectness. Llama 3.2 demonstrated moderate consistency (0.774 to 0.807) alongside the highest accuracy (49.0 to 65.0%). Roleplay prompts consistently reduced accuracy across all models and datasets, with Phi-3 Mini showing the largest decline of 21.5 percentage points on MedQA. Instruction-following failure rates varied by model and were not determined by parameter count, with Phi-3 Mini exhibiting the highest UNKNOWN rate at 10.5% on MedQA. Meditron-7B, a domain-pretrained model without instruction tuning, exhibited near-complete instruction-following failure on PubMedQA (99.0% UNKNOWN rate), demonstrating that domain knowledge alone is insufficient for structured clinical question answering.
Conclusions:
High consistency does not imply correctness in small clinical language models; models can be reliably incorrect, representing a potentially dangerous failure mode in clinical decision support. Roleplay prompt styles may reduce reliability in healthcare AI applications. Among the models evaluated, Llama 3.2 demonstrated the strongest balance of accuracy and reliability for low-resource deployment. These findings highlight the necessity of multidimensional evaluation frameworks that assess consistency, accuracy, and instruction adherence jointly for safe clinical AI deployment.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.