Currently submitted to: JMIR AI
Date Submitted: Apr 21, 2026
Open Peer Review Period: Apr 28, 2026 - Jun 23, 2026
(currently open for review)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Clinical Accuracy and Safety of a Locally Hosted Large Language Model for Pediatric E-Consults: A Blinded Multi-Subspecialty Evaluation
ABSTRACT
Background:
Electronic consultations (e-consults) improve access to pediatric subspecialty care, particularly in rural settings, but rising consult volume contributes to subspecialist documentation burden, creating interest in whether large language models can safely assist with draft response generation.
Objective:
To evaluate the clinical utility, safety, and accuracy of a locally hosted, open-source large language model (LLM) in drafting pediatric subspecialty e-consult responses.
Methods:
We compared AI-generated consult drafts (Qwen3-30B, hospital-hosted) with human subspecialist-written e-consults for 50 real pediatric cases. Blinded pediatric subspecialists (n=50 case ratings) and generalists (n=20 case ratings) assessed accuracy, appropriateness, communication quality, and safety using structured rating instruments. Reviewer free-text comments underwent thematic analysis.
Results:
Among 50 cases, 60% of AI-generated drafts were rated as reasonable medical advice compared with 98% of physician-authored consults. False statements were identified in 39% of AI drafts, incorrect details in 58%, and potentially harmful omissions in 30%. Despite these errors, 70% of AI drafts were considered safe and potentially useful as initial drafts under specialist oversight. Performance varied by subspecialty, with neurology drafts most frequently rated reasonable (90%) and infectious disease and endocrinology drafts rated lower (40%-60%). Generalists found AI drafts understandable and comfortable to act upon in 80% of cases.
Conclusions:
While locally hosted LLMs show promise as drafting assistants to improve efficiency, high rates of clinical inaccuracies preclude their autonomous use. Specialty-specific guardrails and rigorous human oversight remain essential for safe implementation.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.