JMIR Preprints #82026: Multi-Evidence Clinical Reasoning With Retrieval-Augmented Generation for Emergency Triage: Retrospective Evaluation Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Multi-Evidence Clinical Reasoning With Retrieval-Augmented Generation for Emergency Triage: Retrospective Evaluation Study

Hang Sheung Wong;
Tsz Kwan Wong

Background:

Emergency triage accuracy is critical but varies with clinician experience, cognitive load, and case complexity. Mis-triage can delay care for high-risk patients and exacerbate crowding through unnecessary prioritization. Large language models (LLMs) show promise as triage decision-support tools but are vulnerable to hallucinations. Retrieval-augmented generation (RAG) may improve reliability by grounding LLM reasoning in authoritative guidelines and real clinical cases.

Objective:

This study aimed to evaluate whether a dual-source RAG system that integrates guideline- and case-based evidence improves emergency triage performance versus a baseline LLM and to assess how closely its urgency assignments align with expert consensus and outcome-defined clinical severity.

Methods:

We developed a dual-source RAG system—Multi-Evidence Clinical Reasoning RAG (MECR-RAG)—that retrieves sections from the Hong Kong Accident and Emergency Triage Guidelines (HKAETG) and cases from a database of 3000 emergency department triage encounters. In a retrospective single‑center evaluation, MECR‑RAG and a prompt‑only baseline LLM (both DeepSeek‑V3) were tested on 236 routine triage encounters to predict 5‑level triage categories. Expert consensus reference labels were assigned by blinded senior triage nurses. Primary outcomes were quadratic weighted kappa (QWK) and accuracy versus consensus labels. Secondary analyses examined performance within 3 operationally and clinically relevant triage bands—immediate (categories 1 and 2), urgent (category 3), and nonurgent (categories 4 and 5). In 226 encounters with follow‑up, we also assigned outcome‑based severity tiers (R1-R3) using a published 3‑level urgency reference standard and defined a disposition‑safety composite.

Results:

MECR‑RAG achieved a mean QWK of 0.902 (SD 0.0021; 95% CI 0.901-0.904) and accuracy of 0.802 (SD 0.0082; 95% CI 0.795-0.808), outperforming the baseline LLM (QWK 0.801, SD 0.004; accuracy 0.542, SD 0.0073; both P<.001) and demonstrating expert‑comparable agreement with triage nurses (interrater QWK 0.887). In 3‑group analysis, MECR‑RAG reduced overtriage from 68/236 (28.8%) with the baseline LLM to 30/236 (12.7%) and maintained low undertriage from 4/236 (1.7%) to 3/236 (1.3%), with the largest gains in the diagnostically ambiguous yet operationally important categories 3 and 4. In a secondary outcome‑based analysis defining high‑severity courses as R1+R2, MECR‑RAG detected high-risk patients more sensitively than initial nurse triage (124/130, 95.4% vs 117/130, 90.0%; P=.02) while maintaining nurse‑level specificity. MECR‑RAG yielded the lowest weighted harm index (13.7, 19.5, and 20.3 per 100 patients for MECR‑RAG, nurses, and the baseline LLM, respectively).

Conclusions:

A dual‑source RAG triage system that combines guideline‑based rules with case‑based reasoning achieved expert‑comparable agreement, reduced overtriage, and better aligned urgency assignments than a prompt‑only baseline LLM. Secondary outcome–based analyses in this cohort suggested more favorable triage patterns than initial nurse triage, supporting MECR‑RAG as a concurrent decision‑support layer that flags discordant or high‑risk assignments; prospective multicenter implementation studies are needed to determine effects on emergency department crowding, delays, and patient outcomes.

Clinicaltrial:

Citation

Please cite as:

Wong HS, Wong TK

Multi-Evidence Clinical Reasoning With Retrieval-Augmented Generation for Emergency Triage: Retrospective Evaluation Study

JMIR Med Inform 2026;14:e82026

DOI: 10.2196/82026

PMID: 41587455

PMCID: 12887567

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Aug 7, 2025

Open Peer Review Period: Aug 11, 2025 - Oct 6, 2025

Date Accepted: Dec 29, 2025

(closed for review but you can still tweet)

Multi-Evidence Clinical Reasoning With Retrieval-Augmented Generation for Emergency Triage: Retrospective Evaluation Study

Citation

Copyright