Accepted for/Published in: JMIR AI
Date Submitted: Sep 5, 2025
Open Peer Review Period: Sep 19, 2025 - Nov 14, 2025
Date Accepted: Jan 5, 2026
(closed for review but you can still tweet)
Performance of Large Language Models Under Input Variability in Healthcare Applications: Dataset Development and Experimental Evaluation
ABSTRACT
Background:
Large Language Models (LLMs) are increasingly integrated into healthcare, where they contribute to patient care, administrative efficiency, and clinical decision-making. Despite their growing role, the ability of LLMs to handle imperfect inputs remains underexplored. These imperfections, which are common in clinical documentation and patient-generated data, may affect model reliability.
Objective:
This study investigates the impact of input perturbations on LLM performance across three dimensions: (1) overall effectiveness in different health-related applications, (2) comparative effects of different types and levels of perturbations, and (3) differential impact of perturbations on health-related terms versus non-health-related terms.
Methods:
We systematically evaluate three LLMs on three health-related tasks using a novel dataset containing three types of human-like variations (redaction, homophones, and typographical errors) at different perturbation levels.
Results:
Contrary to expectations, LLMs demonstrate notable robustness to common variations, with some cases showing improved performance at lower perturbation levels. Redactions, often stemming from privacy concerns or cognitive lapses, are more detrimental than other variations.
Conclusions:
Our findings highlight the need for healthcare applications powered by LLMs to be designed with input variability in mind. Robustness to noisy or imperfect inputs is essential for maintaining reliability in real-world clinical settings, where data quality can vary widely. By identifying specific vulnerabilities and strengths, this work provides actionable insights for improving model resilience and guiding the development of safer, more effective AI tools in healthcare. The accompanying dataset offers a valuable resource for further research into LLM performance under diverse conditions. Clinical Trial: (N/A)
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.