Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR AI

Date Submitted: Sep 5, 2025
Open Peer Review Period: Sep 19, 2025 - Nov 14, 2025
Date Accepted: Jan 5, 2026
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Performance of Large Language Models Under Input Variability in Health Care Applications: Dataset Development and Experimental Evaluation

Joshi S, Mehta M, Maniar S, Wang M, Singh VK

Performance of Large Language Models Under Input Variability in Health Care Applications: Dataset Development and Experimental Evaluation

JMIR AI 2026;5:e83640

DOI: 10.2196/83640

PMID: 41719488

PMCID: 12923095

Performance of Large Language Models Under Input Variability in Healthcare Applications: Dataset Development and Experimental Evaluation

  • Saubhagya Joshi; 
  • Monjil Mehta; 
  • Sarjak Maniar; 
  • Mengqian Wang; 
  • Vivek Kumar Singh

ABSTRACT

Background:

Large Language Models (LLMs) are increasingly integrated into healthcare, where they contribute to patient care, administrative efficiency, and clinical decision-making. Despite their growing role, the ability of LLMs to handle imperfect inputs remains underexplored. These imperfections, which are common in clinical documentation and patient-generated data, may affect model reliability.

Objective:

This study investigates the impact of input perturbations on LLM performance across three dimensions: (1) overall effectiveness in different health-related applications, (2) comparative effects of different types and levels of perturbations, and (3) differential impact of perturbations on health-related terms versus non-health-related terms.

Methods:

We systematically evaluate three LLMs on three health-related tasks using a novel dataset containing three types of human-like variations (redaction, homophones, and typographical errors) at different perturbation levels.

Results:

Contrary to expectations, LLMs demonstrate notable robustness to common variations, with some cases showing improved performance at lower perturbation levels. Redactions, often stemming from privacy concerns or cognitive lapses, are more detrimental than other variations.

Conclusions:

Our findings highlight the need for healthcare applications powered by LLMs to be designed with input variability in mind. Robustness to noisy or imperfect inputs is essential for maintaining reliability in real-world clinical settings, where data quality can vary widely. By identifying specific vulnerabilities and strengths, this work provides actionable insights for improving model resilience and guiding the development of safer, more effective AI tools in healthcare. The accompanying dataset offers a valuable resource for further research into LLM performance under diverse conditions. Clinical Trial: (N/A)


 Citation

Please cite as:

Joshi S, Mehta M, Maniar S, Wang M, Singh VK

Performance of Large Language Models Under Input Variability in Health Care Applications: Dataset Development and Experimental Evaluation

JMIR AI 2026;5:e83640

DOI: 10.2196/83640

PMID: 41719488

PMCID: 12923095

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.