Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Currently submitted to: JMIR AI

Date Submitted: Apr 21, 2026
Open Peer Review Period: Apr 28, 2026 - Jun 23, 2026
(currently open for review)

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Clinical Accuracy and Safety of a Locally Hosted Large Language Model for Pediatric E-Consults: A Blinded Multi-Subspecialty Evaluation

  • Marleah Knights; 
  • Amna Umer; 
  • David Rich; 
  • Juggy Jaganathan; 
  • Lee Pyles; 
  • Michael Sweetman; 
  • Audra Rouster; 
  • David Huss; 
  • Lawrence Morton; 
  • Matthew Thomas; 
  • Rouba Sayegh; 
  • Brian Ely; 
  • Evan Jones; 
  • Jai Udassi; 
  • Rafka Chaiban; 
  • Collin John; 
  • Bryce Harvey; 
  • Charles Jacob Mullett

ABSTRACT

Background:

Electronic consultations (e-consults) improve access to pediatric subspecialty care, particularly in rural settings, but rising consult volume contributes to subspecialist documentation burden, creating interest in whether large language models can safely assist with draft response generation.

Objective:

To evaluate the clinical utility, safety, and accuracy of a locally hosted, open-source large language model (LLM) in drafting pediatric subspecialty e-consult responses.

Methods:

We compared AI-generated consult drafts (Qwen3-30B, hospital-hosted) with human subspecialist-written e-consults for 50 real pediatric cases. Blinded pediatric subspecialists (n=50 case ratings) and generalists (n=20 case ratings) assessed accuracy, appropriateness, communication quality, and safety using structured rating instruments. Reviewer free-text comments underwent thematic analysis.

Results:

Among 50 cases, 60% of AI-generated drafts were rated as reasonable medical advice compared with 98% of physician-authored consults. False statements were identified in 39% of AI drafts, incorrect details in 58%, and potentially harmful omissions in 30%. Despite these errors, 70% of AI drafts were considered safe and potentially useful as initial drafts under specialist oversight. Performance varied by subspecialty, with neurology drafts most frequently rated reasonable (90%) and infectious disease and endocrinology drafts rated lower (40%-60%). Generalists found AI drafts understandable and comfortable to act upon in 80% of cases.

Conclusions:

While locally hosted LLMs show promise as drafting assistants to improve efficiency, high rates of clinical inaccuracies preclude their autonomous use. Specialty-specific guardrails and rigorous human oversight remain essential for safe implementation.


 Citation

Please cite as:

Knights M, Umer A, Rich D, Jaganathan J, Pyles L, Sweetman M, Rouster A, Huss D, Morton L, Thomas M, Sayegh R, Ely B, Jones E, Udassi J, Chaiban R, John C, Harvey B, Mullett CJ

Clinical Accuracy and Safety of a Locally Hosted Large Language Model for Pediatric E-Consults: A Blinded Multi-Subspecialty Evaluation

JMIR Preprints. 21/04/2026:98385

DOI: 10.2196/preprints.98385

URL: https://preprints.jmir.org/preprint/98385

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.