JMIR Preprints #53297: Triage Performance Across Large Language Models, ChatGPT and Untrained Doctors: A Comparative Study in Emergency Medicine

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Triage Performance Across Large Language Models, ChatGPT and Untrained Doctors: A Comparative Study in Emergency Medicine

Lars Masanneck;
Linea Schmidt;
Antonia Seifert;
Tristan Kölsche;
Niklas Huntemann;
Robin Jansen;
Mohammed Mehsin;
Michael Bernhard;
Sven Meuth;
Lennert Boehm;
Marc Pawlitzki

ABSTRACT

Background:

Large language models have demonstrated impressive performances in various medical domains, prompting an exploration of their potential utility within the high-demand setting of emergency department triage. This study evaluates the triage proficiency of ChatGPT, a large language model, compared with professionally trained emergency department staff and untrained personnel. We further explore whether large language model responses could guide untrained staff in effective triage.

Objective:

To assess the efficacy of ChatGPT in emergency department triage compared to professionally trained emergency department staff and untrained personnel, and to investigate if the model's responses can enhance the triage proficiency of untrained personnel.

Methods:

Utilizing a cohort-study-like-design, we assessed anonymized case vignettes based on a day’s worth of emergency cases. These were triaged by untrained staff, different versions of ChatGPT, and professionally trained raters, who subsequently agreed on a consensus set, according to the Manchester Triage System. The vignettes were adapted from cases at a tertiary emergency department in Germany. 124 prototypical patient vignettes were utilized, with demographic characteristics altered where irrelevant to the diagnosis or treatment. Main outcome was the level of agreement between raters' Manchester Triage System level assignments, measured via quadratic weighted Cohen’s Kappa (ҡ). The extent of over- and undertriage was also determined.

Results:

ChatGPT 4 and untrained staff showed substantial agreement with consensus triage (ҡ = 0.67 ± 0.037 and ҡ = 0.68 ± 0.056, respectively), significantly exceeding the performance of ChatGPT 3.5 (ҡ = 0.54 ± 0.024) but falling short compared to professional raters. When untrained staff used the large language model for second-opinion triage, there was a slight but statistically insignificant performance increase (ҡ = 0.70 ± 0.047). ChatGPT models tended towards overtriage, while untrained staff undertriaged.

Conclusions:

While ChatGPT does not yet match professionally trained raters, its triage proficiency equals that of untrained emergency department staff. Notable performance enhancements in newer large language model versions hint at future improvements with further technological development and specific training. Further studies are needed to determine optimal large language model utilization within the emergency department, particularly regarding its potential as a second-opinion tool for experienced and inexperienced raters.

Citation

Please cite as:

Masanneck L, Schmidt L, Seifert A, Kölsche T, Huntemann N, Jansen R, Mehsin M, Bernhard M, Meuth S, Boehm L, Pawlitzki M

Triage Performance Across Large Language Models, ChatGPT, and Untrained Doctors in Emergency Medicine: Comparative Study

J Med Internet Res 2024;26:e53297

DOI: 10.2196/53297

PMID: 38875696

PMCID: 11214027

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Oct 2, 2023

Open Peer Review Period: Oct 2, 2023 - Nov 27, 2023

Date Accepted: May 14, 2024

(closed for review but you can still tweet)

Triage Performance Across Large Language Models, ChatGPT and Untrained Doctors: A Comparative Study in Emergency Medicine

ABSTRACT

Citation

Copyright