Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Aug 7, 2025
Open Peer Review Period: Aug 11, 2025 - Oct 6, 2025
Date Accepted: Dec 29, 2025
(closed for review but you can still tweet)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Multi-Evidence Clinical Reasoning with Retrieval-Augmented Generation (MECR-RAG) for Emergency Triage: Retrospective Evaluation Study
ABSTRACT
Background:
Emergency triage accuracy is vital yet varies significantly due to factors like clinical experience, cognitive load, and symptom complexity. Inaccuracies can lead to critical consequences, including preventable morbidity, mortality, or resource misallocation. Large language models (LLMs) have shown potential in clinical decision-making but risk generating inaccurate outputs. Retrieval-augmented generation (RAG) systems dynamically retrieve and incorporate external authoritative information to enhance LLM reliability. Previous studies on LLM and emergency triage have typically relied on structured datasets, textbook-derived inputs, or lacked independently adjudicated ground truth, limiting external validity.
Objective:
To evaluate whether a dual-source RAG system integrating procedural and experiential clinical knowledge improves the accuracy and consistency of emergency triage classification compared to baseline LLMs.
Methods:
We developed and evaluated a novel dual-source RAG architecture—Multi-Evidence Clinical Reasoning RAG (MECR-RAG)—that combines the Hong Kong Accident and Emergency Triage Guidelines (HKAETG) with a structured database of 3,000 real-world triage cases from 2024. The system, implemented using DeepSeek-V3, was retrospectively assessed on 236 real clinical triage records sampled across a calendar year. Gold-standard labels were assigned through blinded consensus by senior triage nurses. Model performance was benchmarked against a prompt-only LLM baseline and evaluated using quadratic weighted kappa (QWK), accuracy, and triage group–specific classification metrics including precision, recall and F1 score.
Results:
MECR-RAG achieved a mean QWK of 0.902 (95% CI: 0.901–0.904) and mean accuracy of 0.802 (95% CI: 0.795–0.808), significantly outperforming the baseline LLM (QWK = 0.801; accuracy = 0.542; P<.001). Its agreement was non-inferior to professional raters (QWK = 0.887). The MECR-RAG system achieved an overall F1 score of 0.860, reduced overtriage from 28.8% to 12.7%, and slightly lowered undertriage from 1.7% to 1.3%. The greatest performance gains were observed in Categories 3 and 4, which are the most diagnostically ambiguous and operationally impactful tiers.
Conclusions:
MECR-RAG demonstrates expert-comparable triage accuracy by integrating the triage guideline with case-based reasoning. This study is the first to evaluate a dual-source RAG-enhanced LLM on real triage documentation with expert consensus labels, offering a methodologically rigorous and clinically grounded approach to decision support in emergency medicine.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.