JMIR Preprints #91016: Mapping the Reliability–Readability Gap in AMD Patient Education Across Six Large Language Models

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Mapping the Reliability–Readability Gap in AMD Patient Education Across Six Large Language Models

Zhili Lu;
Haixing Cao;
Cong Ma;
Jin Zheng;
Xiang Ma

ABSTRACT

Background:

Age-related macular degeneration (AMD) is a leading cause of irreversible vision loss globally, requiring patients to understand complex, long-term management plans. While large language models (LLMs) offer a scalable solution for patient education, their direct clinical use is hindered by two critical gaps: excessive readability (often at high-school or college level, far above the recommended 6th-grade standard) and an apparent trade-off where more reliable, comprehensive outputs tend to be less readable. Current evaluations lack head-to-head comparisons of state-of-the-art models in specialized domains like AMD under realistic “zero-shot” conditions that mimic patient queries. This study systematically benchmarks six leading LLMs to quantify this reliability–readability gap, providing an evidence base for the safe, informed integration of AI into ophthalmic patient communication.

Objective:

To address the critical gap between scalable AI communication and clinically safe patient education, this study aimed to benchmark state-of-the-art LLMs for AMD by jointly quantifying informational reliability and linguistic readability under a realistic zero-shot (naïve user) query scenario—a key but under-evaluated setting for clinical deployment.

Methods:

Thirty AMD-related patient questions were curated from Google Trends (Oct 10, 2020–Oct 10, 2025), the 2023 Chinese AMD guideline, and the 2024 AAO recommendations. Each question was entered verbatim into six publicly available LLMs (ChatGPT-5.1-auto, DeepSeek-v3.2, Gemini-2.5-Flash-Thinking, Grok 4, Claude-Sonnet 4.5, Qwen3-Max) during Oct 10–Nov 25, 2025. Two senior ophthalmologists, blinded to model identity, independently scored all responses using DISCERN, EQIP, GQS, and JAMA criteria, with adjudication for disagreements. Readability was assessed using six standard formulas against a ≤6th-grade benchmark. Between-model differences were analyzed using Friedman tests with Holm-adjusted pairwise comparisons.

Results:

RESULTS: Analysis of 180 responses revealed substantial to near-perfect inter-rater agreement (κ=0.72–0.97). Critically, no model met the recommended ≤6th-grade readability target, and a clear reliability–readability trade-off was observed: Grok 4 achieved the highest reliability (DISCERN 46.40±7.43; EQIP 74.33±9.07) while DeepSeek-v3.2 generated the most readable text (FRES 48.23±9.16; FKGL 9.95±1.87). Between-model differences were significant across all metrics (all P<.001), underscoring performance as model-dependent and clinically variable.

Conclusions:

CONCLUSIONS: Under zero-shot conditions, current LLMs cannot simultaneously meet the dual standards of high reliability and guideline-level readability required for direct AMD patient education. These findings mandate clinician-supervised model selection, deliberate readability optimization, and the development of integrated human–AI workflows prior to any patient-facing use.

Citation

Please cite as:

Lu Z, Cao H, Ma C, Zheng J, Ma X

Mapping the Reliability–Readability Gap in AMD Patient Education Across Six Large Language Models

JMIR Preprints. 07/01/2026:91016

DOI: 10.2196/preprints.91016

URL: https://preprints.jmir.org/preprint/91016

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Currently submitted to: JMIR Medical Informatics

Date Submitted: Jan 7, 2026

Mapping the Reliability–Readability Gap in AMD Patient Education Across Six Large Language Models

ABSTRACT

Citation

Copyright