Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Currently submitted to: Journal of Medical Internet Research

Date Submitted: Jun 2, 2026
Open Peer Review Period: Jun 3, 2026 - Jul 29, 2026
(currently open for review)

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Evaluating Search-Enabled Large Language Model Interfaces for Medication Counseling in Secondary Stroke Prevention: A Multi-Metric Comparative Study

  • Zhi Wang; 
  • Yi Zhu; 
  • Lan Xu; 
  • Jingshi Wang

ABSTRACT

Background:

Large language models (LLMs) are increasingly used by patients seeking medication advice. Their quality for secondary stroke prevention counseling has not been well characterized.

Objective:

To compare five widely used search-enabled consumer LLM interfaces on patient-facing medication counseling for secondary stroke prevention across fourteen evaluation metrics covering safety, clinical accuracy, information quality, readability, empathy, actionability, and model test-retest stability, operationalized as lexical text stability.

Methods:

A 56-item English-language question bank was developed from current stroke prevention guidelines and submitted to five consumer LLM interfaces (ChatGPT, Claude, Gemini, DeepSeek, Doubao) via their official web interfaces on May 1, 2026, with repeat querying on May 8, 2026 to assess model test-retest stability. All systems were accessed using a logged-in account with web search enabled via a US-based connection. Responses were independently rated by two blinded raters. Non-parametric tests with Benjamini-Hochberg correction were applied.

Results:

Clinical accuracy was high and uniform across models (mean 4.44-4.52/5; Friedman p = 0.578). Gemini, DeepSeek, and Doubao scored significantly higher on EQIP (70.2-70.7 vs. 63.6-64.1; p < 0.001) and DISCERN (p < 0.001) than ChatGPT and Claude. All models substantially exceeded commonly used patient-education readability benchmarks (FKGL 11.1-14.4; benchmark <=6; FRES 33.4-46.8; benchmark >=60). ChatGPT had the highest unsafe response rate (14.3% vs. 7.1-10.7%).

Conclusions:

In this controlled evaluation of researcher-generated questions, the tested search-enabled LLM interfaces produced broadly accurate responses for secondary stroke prevention medication counseling, but weaknesses in readability, source transparency, and safety indicate that readability optimization, source-attribution prompting, and clinical review are needed before patient-facing use.


 Citation

Please cite as:

Wang Z, Zhu Y, Xu L, Wang J

Evaluating Search-Enabled Large Language Model Interfaces for Medication Counseling in Secondary Stroke Prevention: A Multi-Metric Comparative Study

JMIR Preprints. 02/06/2026:103308

DOI: 10.2196/preprints.103308

URL: https://preprints.jmir.org/preprint/103308

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.