JMIR Preprints #83640: Performance of Large Language Models Under Input Variability in Healthcare Applications: Dataset Development and Experimental Evaluation

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Performance of Large Language Models Under Input Variability in Healthcare Applications: Dataset Development and Experimental Evaluation

Saubhagya Joshi;
Monjil Mehta;
Sarjak Maniar;
Mengqian Wang;
Vivek Kumar Singh

ABSTRACT

Background:

Large Language Models (LLMs) are increasingly integrated into healthcare, where they contribute to patient care, administrative efficiency, and clinical decision-making. Despite their growing role, the ability of LLMs to handle imperfect inputs remains underexplored. These imperfections, which are common in clinical documentation and patient-generated data, may affect model reliability.

Objective:

This study investigates the impact of input perturbations on LLM performance across three dimensions: (1) overall effectiveness in different health-related applications, (2) comparative effects of different types and levels of perturbations, and (3) differential impact of perturbations on health-related terms versus non-health-related terms.

Methods:

We systematically evaluate three LLMs on three health-related tasks using a novel dataset containing three types of human-like variations (redaction, homophones, and typographical errors) at different perturbation levels.

Results:

Contrary to expectations, LLMs demonstrate notable robustness to common variations, with some cases showing improved performance at lower perturbation levels. Redactions, often stemming from privacy concerns or cognitive lapses, are more detrimental than other variations.

Conclusions:

Our findings highlight the need for healthcare applications powered by LLMs to be designed with input variability in mind. Robustness to noisy or imperfect inputs is essential for maintaining reliability in real-world clinical settings, where data quality can vary widely. By identifying specific vulnerabilities and strengths, this work provides actionable insights for improving model resilience and guiding the development of safer, more effective AI tools in healthcare. The accompanying dataset offers a valuable resource for further research into LLM performance under diverse conditions. Clinical Trial: (N/A)

Citation

Please cite as:

Joshi S, Mehta M, Maniar S, Wang M, Singh VK

Performance of Large Language Models Under Input Variability in Health Care Applications: Dataset Development and Experimental Evaluation

JMIR AI 2026;5:e83640

DOI: 10.2196/83640

PMID: 41719488

PMCID: 12923095

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR AI

Date Submitted: Sep 5, 2025

Open Peer Review Period: Sep 19, 2025 - Nov 14, 2025

Date Accepted: Jan 5, 2026

(closed for review but you can still tweet)

Performance of Large Language Models Under Input Variability in Healthcare Applications: Dataset Development and Experimental Evaluation

ABSTRACT

Citation

Copyright