Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Oct 2, 2023
Open Peer Review Period: Oct 2, 2023 - Nov 27, 2023
Date Accepted: May 14, 2024
(closed for review but you can still tweet)
Triage Performance Across Large Language Models, ChatGPT and Untrained Doctors: A Comparative Study in Emergency Medicine
ABSTRACT
Background:
Large language models have demonstrated impressive performances in various medical domains, prompting an exploration of their potential utility within the high-demand setting of emergency department triage. This study evaluates the triage proficiency of ChatGPT, a large language model, compared with professionally trained emergency department staff and untrained personnel. We further explore whether large language model responses could guide untrained staff in effective triage.
Objective:
To assess the efficacy of ChatGPT in emergency department triage compared to professionally trained emergency department staff and untrained personnel, and to investigate if the model's responses can enhance the triage proficiency of untrained personnel.
Methods:
Utilizing a cohort-study-like-design, we assessed anonymized case vignettes based on a day’s worth of emergency cases. These were triaged by untrained staff, different versions of ChatGPT, and professionally trained raters, who subsequently agreed on a consensus set, according to the Manchester Triage System. The vignettes were adapted from cases at a tertiary emergency department in Germany. 124 prototypical patient vignettes were utilized, with demographic characteristics altered where irrelevant to the diagnosis or treatment. Main outcome was the level of agreement between raters' Manchester Triage System level assignments, measured via quadratic weighted Cohen’s Kappa (ҡ). The extent of over- and undertriage was also determined.
Results:
ChatGPT 4 and untrained staff showed substantial agreement with consensus triage (ҡ = 0.67 ± 0.037 and ҡ = 0.68 ± 0.056, respectively), significantly exceeding the performance of ChatGPT 3.5 (ҡ = 0.54 ± 0.024) but falling short compared to professional raters. When untrained staff used the large language model for second-opinion triage, there was a slight but statistically insignificant performance increase (ҡ = 0.70 ± 0.047). ChatGPT models tended towards overtriage, while untrained staff undertriaged.
Conclusions:
While ChatGPT does not yet match professionally trained raters, its triage proficiency equals that of untrained emergency department staff. Notable performance enhancements in newer large language model versions hint at future improvements with further technological development and specific training. Further studies are needed to determine optimal large language model utilization within the emergency department, particularly regarding its potential as a second-opinion tool for experienced and inexperienced raters.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.