Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Aug 2, 2025
Open Peer Review Period: Aug 15, 2025 - Oct 10, 2025
Date Accepted: Feb 24, 2026
(closed for review but you can still tweet)
Multidimensional Evaluation of AI Chatbot Responses to a Standardized Patient Query on MOGAD: A Blinded Expert Analysis
ABSTRACT
Background:
Large language model-based chatbots are increasingly used by the public to access medical information. While these tools offer considerable potential in terms of accessibility and scalability, their accuracy, transparency, and clarity remain insufficiently evaluated for rare and diagnostically complex conditions such as myelin oligodendrocyte glycoprotein antibody-associated disease (MOGAD).
Objective:
This study aimed to evaluate the quality, comprehensibility, transparency, and readability of responses generated by widely used AI chatbot platforms in response to a standardized, patient-centered question about MOGAD.
Methods:
We conducted a cross-sectional content analysis using the query: “What is MOGAD, and how is MOGAD treated?” Ten widely used chatbot platforms were selected to reflect diversity in architecture, access model, and functional design. Responses were collected on the same day, anonymized, and independently evaluated by seven blinded neurologists. Validated instruments were used, including DISCERN (treatment quality), PEMAT-P (understandability), Web Resource Rating (WRR; citation transparency), and two readability metrics: Flesch–Kincaid Grade Level (FKGL) and Coleman–Liau Index (CLI). Chatbots were also compared by access type (free vs paid) and functional focus (conversation-based vs search-based). Inter-rater reliability was assessed using intraclass correlation coefficients (ICCs).
Results:
Significant differences were observed across platforms in DISCERN, PEMAT-P, and WRR scores (all p < 0.001). Paid chatbots demonstrated higher treatment quality (p = 0.020) and citation transparency (p = 0.001) compared to free versions. Search-based models produced more understandable responses than conversation-based ones (p = 0.035). However, none of the chatbot responses achieved the recommended readability threshold for public-facing health communication (FKGL < 8). Inter-rater agreement was excellent across all expert-rated measures (ICC ≥ 0.838).
Conclusions:
AI chatbot responses to patient queries about MOGAD vary widely in quality, clarity, and transparency. These findings highlight the need for structured benchmarking, transparent evaluation frameworks, and thoughtful oversight in the use of generative AI tools for digital health communication, particularly in the context of rare and clinically complex diseases.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.