Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Feb 7, 2026
Date Accepted: May 18, 2026

The final, peer-reviewed published version of this preprint can be found here:

Evaluation of Five Large Language Models for Parental Education in Pediatric Anesthesia: Reliability and Readability Study

Pu F, Hong J, Wei X, Chen Y

Evaluation of Five Large Language Models for Parental Education in Pediatric Anesthesia: Reliability and Readability Study

JMIR Med Inform 2026;14:e93054

DOI: 10.2196/93054

PMID: 42314147

Evaluation of Five Large Language Models for Parental Education in Pediatric Anesthesia: Reliability and Readability Study

  • Fulin Pu; 
  • Jishuang Hong; 
  • Xiaoying Wei; 
  • Yanling Chen

ABSTRACT

Background:

Large Language Models (LLMs) are increasingly utilized in healthcare to generate detailed medical responses. However, their performance in providing reliable and readable information for pediatric anesthesia remains unclear.

Objective:

To evaluate the reliability and readability of LLM responses to parental inquiries regarding pediatric anesthesia.

Methods:

On December 14, 2025, five LLMs (DeepSeek-V3.2, ChatGPT-5, Gemini 2.5 Flash, Copilot, and Perplexity) accessed via official web-based interfaces were evaluated. Thirty-three parental inquiries from multiple authoritative sources were used for zero-shot prompting to generate responses. Two blinded senior anesthesiologists independently assessed the reliability using the DISCERN instrument, Ensuring Quality Information for Patients (EQIP) tool, Journal of the American Medical Association (JAMA) benchmark, and Global Quality Score (GQS). Readability was evaluated using six automated indices.

Results:

Perplexity showed superior reliability on DISCERN (median 41; P<.05), yet no model achieved a “good” rating. Crucially, qualitative analysis revealed safety hazards, such as Perplexity’s misleading binary summary regarding breastfeeding, which contradicted preoperative fasting protocols. Gemini exhibited structural-quality dissociation, achieving the highest EQIP (median 90; P<.001) despite lower GQS (median 3). Transparency was universally poor (JAMA median ≤1), with DeepSeek and ChatGPT showing a “floor effect”. ChatGPT had superior readability, but all models exceeded the recommended sixth-grade complexity level.

Conclusions:

Current LLMs are insufficient as standalone resources. Structural-quality dissociation, poor transparency, and poor readability pose safety risks. Consequently, strict clinical professional review is mandatory until future models simultaneously ensure clinical reliability and optimize patient-centered readability.


 Citation

Please cite as:

Pu F, Hong J, Wei X, Chen Y

Evaluation of Five Large Language Models for Parental Education in Pediatric Anesthesia: Reliability and Readability Study

JMIR Med Inform 2026;14:e93054

DOI: 10.2196/93054

PMID: 42314147

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.