JMIR Preprints #103394: A CHART-Informed Evaluation of Six Large Language Model Platforms for Post-Stroke Patient Education: Safety, Quality, Readability, and Transparency of Single-Turn Responses

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

A CHART-Informed Evaluation of Six Large Language Model Platforms for Post-Stroke Patient Education: Safety, Quality, Readability, and Transparency of Single-Turn Responses

Nini Zhu;
Ximan Yao;
Liheng Zhou

ABSTRACT

Background:

Stroke survivors have substantial post-discharge information needs, yet the performance of large language model (LLM) platforms in post-stroke patient education remains unclear. This study evaluated the safety, accuracy, empathy, quality, and readability of six publicly available LLM platforms in answering post-stroke health-related questions.

Objective:

This study aimed to evaluate and compare the safety, accuracy, empathy, information quality, readability, and transparency of single-turn responses generated by six large language model platforms to patient-informed post-stroke health-related questions, using a CHART-informed reporting framework.

Methods:

This single-round, cross-sectional comparative study was conducted from April to May 2026 and reported according to the Chatbot Assessment Reporting Tool statement. A researcher-developed, patient-informed question bank containing 62 English post-stroke health questions across 10 domains was constructed using patient interviews, Google Trends data, and stroke-related evidence-based literature. On May 10, 2026, each question was submitted once in a new independent session to ChatGPT-5.5, Gemini 3.1 Pro, Claude Sonnet 4.6, DeepSeek-V4, Qwen 3.6 Plus, and ERNIE Bot 5.0, yielding 372 responses. Five blinded expert raters evaluated safety, accuracy, empathy, information reliability, patient education quality, transparency, and overall quality using predefined criteria, DISCERN, EQIP, JAMA benchmark criteria, and the Global Quality Scale. Readability was assessed separately using six formula-based indices.

Results:

Accuracy scores did not differ significantly across platforms (P = 0.279; Kendall’s W = 0.020). Safe response rates differed overall (Cochran’s Q = 18.857, P = 0.002), ranging from 87.10% for Gemini 3.1 Pro to 100.00% for ERNIE Bot 5.0; however, adjusted pairwise McNemar tests showed no significant differences between individual platforms. Significant platform-level differences were observed in empathy, DISCERN, EQIP, JAMA, Global Quality Scale scores, and readability indices. Qwen 3.6 Plus and ERNIE Bot 5.0 generally showed higher descriptive performance in empathy and patient education quality, whereas DeepSeek-V4 and Claude Sonnet 4.6 showed more favorable formula-based readability. Inter-rater reliability was good, with Fleiss’ κ = 0.843 for safety and intraclass correlation coefficients of 0.821–0.870 for other main metrics.

Conclusions:

In this exploratory single-round English-language evaluation, six publicly available LLM platforms generated generally high expert-rated accuracy for post-stroke patient education questions but differed in safety classifications, empathy, information quality, transparency, and formula-based readability. Because outputs were generated once per question under retrieval-enabled platform conditions, the findings should be interpreted as time-sensitive text-performance results rather than evidence of real-world clinical safety or effectiveness. Professionally reviewed LLM-generated text may support low-risk patient education, but it should not replace clinician judgment or guide urgent symptoms, medication adjustment, dysphagia management, fall prevention, or individualized rehabilitation planning.

Citation

Please cite as:

Zhu N, Yao X, Zhou L

A CHART-Informed Evaluation of Six Large Language Model Platforms for Post-Stroke Patient Education: Safety, Quality, Readability, and Transparency of Single-Turn Responses

JMIR Preprints. 02/06/2026:103394

DOI: 10.2196/preprints.103394

URL: https://preprints.jmir.org/preprint/103394

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Currently submitted to: JMIR Medical Informatics

Date Submitted: Jun 2, 2026

Open Peer Review Period: Jun 17, 2026 - Aug 12, 2026

(currently open for review)

A CHART-Informed Evaluation of Six Large Language Model Platforms for Post-Stroke Patient Education: Safety, Quality, Readability, and Transparency of Single-Turn Responses

ABSTRACT

Citation

Copyright