JMIR Preprints #93054: Evaluation of Five Large Language Models for Parental Education in Pediatric Anesthesia: Reliability and Readability Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Evaluation of Five Large Language Models for Parental Education in Pediatric Anesthesia: Reliability and Readability Study

Fulin Pu;
Jishuang Hong;
Xiaoying Wei;
Yanling Chen

ABSTRACT

Background:

Large Language Models (LLMs) are increasingly utilized in healthcare to generate detailed medical responses. However, their performance in providing reliable and readable information for pediatric anesthesia remains unclear.

Objective:

To evaluate the reliability and readability of LLM responses to parental inquiries regarding pediatric anesthesia.

Methods:

On December 14, 2025, five LLMs (DeepSeek-V3.2, ChatGPT-5, Gemini 2.5 Flash, Copilot, and Perplexity) accessed via official web-based interfaces were evaluated. Thirty-three parental inquiries from multiple authoritative sources were used for zero-shot prompting to generate responses. Two blinded senior anesthesiologists independently assessed the reliability using the DISCERN instrument, Ensuring Quality Information for Patients (EQIP) tool, Journal of the American Medical Association (JAMA) benchmark, and Global Quality Score (GQS). Readability was evaluated using six automated indices.

Results:

Perplexity showed superior reliability on DISCERN (median 41; P<.05), yet no model achieved a “good” rating. Crucially, qualitative analysis revealed safety hazards, such as Perplexity’s misleading binary summary regarding breastfeeding, which contradicted preoperative fasting protocols. Gemini exhibited structural-quality dissociation, achieving the highest EQIP (median 90; P<.001) despite lower GQS (median 3). Transparency was universally poor (JAMA median ≤1), with DeepSeek and ChatGPT showing a “floor effect”. ChatGPT had superior readability, but all models exceeded the recommended sixth-grade complexity level.

Conclusions:

Current LLMs are insufficient as standalone resources. Structural-quality dissociation, poor transparency, and poor readability pose safety risks. Consequently, strict clinical professional review is mandatory until future models simultaneously ensure clinical reliability and optimize patient-centered readability.

Citation

Please cite as:

Pu F, Hong J, Wei X, Chen Y

Evaluation of Five Large Language Models for Parental Education in Pediatric Anesthesia: Reliability and Readability Study

JMIR Med Inform 2026;14:e93054

DOI: 10.2196/93054

PMID: 42314147

PMCID: 13278617

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Feb 7, 2026

Date Accepted: May 18, 2026

Evaluation of Five Large Language Models for Parental Education in Pediatric Anesthesia: Reliability and Readability Study

ABSTRACT

Citation

Copyright