Currently accepted at: Journal of Medical Internet Research
Date Submitted: Dec 11, 2025
Date Accepted: May 17, 2026
This paper has been accepted and is currently in production.
It will appear shortly on 10.2196/88126
The final accepted version (not copyedited yet) is in this tab.
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Quantifying Factors that Drive Trust and Satisfaction with AI Health Chatbots: A Mixed-Methods Vignette Survey of Caregivers for Pediatric Infectious Diseases
ABSTRACT
Background:
AI chatbots like ChatGPT and Claude have become popular sources for instant medical advice, particularly for urgent situations like parents dealing with a child's fever at night. While this accessibility benefits millions of users, it creates significant public health concerns when responses are inaccurate, unclear, or inappropriate. Headlines about hallucinations focus on accuracy, yet they overlook other qualities that shape real decisions, such as comprehensibility, appropriate scope, empathy, and trustworthiness from the patient’s point of view. Several human evaluation frameworks have been developed to capture these nuances. However, they are created top-down from theoretical models like expert interviews and literature reviews and thus lack empirical evidence to support their validity or effectiveness in the real world. It is still unknown which dimensions matter most to users, their relative importance, and whether users interpret them consistently. Without empirical evidence from real end-users, existing frameworks may miss the subtle trade-offs and context-specific needs that emerge in actual chatbot interactions. Prior work has also leaned on clinicians as proxies for patients and caregivers, which underrepresents the views of actual end-users.
Objective:
We conducted an exploratory study that examined how established evaluation dimensions influence user satisfaction in practice and uncovered potential gaps in existing frameworks. We aim to move beyond current top-down approaches by providing empirical evidence about which evaluation dimensions truly drive user satisfaction, their relative importance, and whether users perceive them consistently.
Methods:
We conducted a mixed-methods vignette survey to evaluate stakeholder satisfaction with AI chatbot responses to pediatric health questions. Each participant compared two GPT-4 responses to the same question, rated them on eight commonly cited dimensions, and then indicated overall satisfaction. We quantified the unique influence of each dimension on overall satisfaction using a Cumulative Link Mixed Model (CLMM). To probe why certain features mattered, we conducted an inductive thematic analysis of participants’ open-ended responses, uncovering unmet information needs and revealing potential gaps in existing evaluation frameworks.
Results:
The study recruited 151 caregivers to rate chatbot responses to three clinician-approved pediatric health vignettes across eight common evaluation dimensions, yielding 906 individual ratings. We first confirmed the face validity of existing human evaluation frameworks: caregivers affirmed the relevance of the eight evaluation dimensions, with a median rating all above “moderate importance”. CLMM model suggested that the perceived Usefulness (Odds ratio=2.49, p<0.001) and Thoroughness (Odds ratio=2.05, p<0.001) were the strongest dimensions that drive overall user satisfaction, where each one-point increase on a 5-point scale almost doubles the odds of higher overall satisfaction. All other dimensions except Comprehensiveness demonstrated statistically significant positive effects, with moderate Odds ratios ranging from 1.44 to 1.70. Comprehensiveness failed to predict satisfaction once other dimensions were controlled. Our qualitative analysis further revealed nuances that existing evaluation frameworks failed to capture and three design imperatives for future researchers: how to strike a balance between concise yet actionable information, dynamically adjust to the appropriate tone based on context and user preferences, and issue timely disclaimers and urgency indicators.
Conclusions:
This study provides the first empirical evidence quantifying how different evaluation dimensions influence caregiver satisfaction with medical AI chatbots. While existing frameworks demonstrate face validity, not all dimensions carry equal weight in practice, and some—like empathy and comprehensiveness—operate in more complex, non-monotonic ways than previously assumed. The findings reveal critical design tensions that demand adaptive, context-aware solutions rather than one-size-fits-all approaches. Future medical chatbots must balance brevity with actionable depth, calibrate tone to situational urgency and individual preference, and deploy disclaimers strategically based on medical risk. These insights offer concrete guidance for both improving human evaluation protocols and designing chatbot systems that better serve the needs of real end-users in high-stakes health decisions.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.