Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Currently submitted to: JMIR Medical Informatics

Date Submitted: Jan 7, 2026

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Mapping the Reliability–Readability Gap in AMD Patient Education Across Six Large Language Models

  • Zhili Lu; 
  • Haixing Cao; 
  • Cong Ma; 
  • Jin Zheng; 
  • Xiang Ma

ABSTRACT

Background:

Age-related macular degeneration (AMD) is a leading cause of irreversible vision loss globally, requiring patients to understand complex, long-term management plans. While large language models (LLMs) offer a scalable solution for patient education, their direct clinical use is hindered by two critical gaps: excessive readability (often at high-school or college level, far above the recommended 6th-grade standard) and an apparent trade-off where more reliable, comprehensive outputs tend to be less readable. Current evaluations lack head-to-head comparisons of state-of-the-art models in specialized domains like AMD under realistic “zero-shot” conditions that mimic patient queries. This study systematically benchmarks six leading LLMs to quantify this reliability–readability gap, providing an evidence base for the safe, informed integration of AI into ophthalmic patient communication.

Objective:

To address the critical gap between scalable AI communication and clinically safe patient education, this study aimed to benchmark state-of-the-art LLMs for AMD by jointly quantifying informational reliability and linguistic readability under a realistic zero-shot (naïve user) query scenario—a key but under-evaluated setting for clinical deployment.

Methods:

Thirty AMD-related patient questions were curated from Google Trends (Oct 10, 2020–Oct 10, 2025), the 2023 Chinese AMD guideline, and the 2024 AAO recommendations. Each question was entered verbatim into six publicly available LLMs (ChatGPT-5.1-auto, DeepSeek-v3.2, Gemini-2.5-Flash-Thinking, Grok 4, Claude-Sonnet 4.5, Qwen3-Max) during Oct 10–Nov 25, 2025. Two senior ophthalmologists, blinded to model identity, independently scored all responses using DISCERN, EQIP, GQS, and JAMA criteria, with adjudication for disagreements. Readability was assessed using six standard formulas against a ≤6th-grade benchmark. Between-model differences were analyzed using Friedman tests with Holm-adjusted pairwise comparisons.

Results:

RESULTS: Analysis of 180 responses revealed substantial to near-perfect inter-rater agreement (κ=0.72–0.97). Critically, no model met the recommended ≤6th-grade readability target, and a clear reliability–readability trade-off was observed: Grok 4 achieved the highest reliability (DISCERN 46.40±7.43; EQIP 74.33±9.07) while DeepSeek-v3.2 generated the most readable text (FRES 48.23±9.16; FKGL 9.95±1.87). Between-model differences were significant across all metrics (all P<.001), underscoring performance as model-dependent and clinically variable.

Conclusions:

CONCLUSIONS: Under zero-shot conditions, current LLMs cannot simultaneously meet the dual standards of high reliability and guideline-level readability required for direct AMD patient education. These findings mandate clinician-supervised model selection, deliberate readability optimization, and the development of integrated human–AI workflows prior to any patient-facing use.


 Citation

Please cite as:

Lu Z, Cao H, Ma C, Zheng J, Ma X

Mapping the Reliability–Readability Gap in AMD Patient Education Across Six Large Language Models

JMIR Preprints. 07/01/2026:91016

DOI: 10.2196/preprints.91016

URL: https://preprints.jmir.org/preprint/91016

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.