JMIR Preprints #56110: ChatGPT-4 outperforms emergency department physicians in diagnostic accuracy: retrospective analysis

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

ChatGPT-4 outperforms emergency department physicians in diagnostic accuracy: retrospective analysis

John Michael Hoppe;
Matthias K. Auer;
Anna Strüven;
Steffen Massberg;
Christopher Stremmel

ABSTRACT

Background:

OpenAI's Chat Generative Pretrained Transformer (ChatGPT) is a pioneering artificial intelligence (AI) in natural language processing and offers significant potential in medicine for treatment advice. Also, recent studies show promising results using ChatGPT for emergency medicine triage, however its diagnostic accuracy in the emergency department has not been evaluated.

Objective:

This study compares the diagnostic accuracy of ChatGPT versions 3.5 and 4 against primary treating resident physicians in an emergency room (ER) setting.

Methods:

In 100 adults admitted to our ER in January 2023 for internal medicine issues the diagnostic accuracy was assessed by comparing the diagnoses of ER resident physicians and ChatGPT versions 3.5 and 4 against the final hospital discharge diagnosis, using a point system for grading accuracy.

Results:

The enrolled 100 patients, with a median age of 72, were admitted to our internal medicine emergency department, primarily for cardiovascular, endocrine or gastrointestinal and infectious diseases. ChatGPT-4 outperformed both ChatGPT-3.5 (p < 0.001) and ER resident physicians (p = 0.012) in diagnostic accuracy for internal medicine emergencies. Also, across various disease subgroups, ChatGPT-4 consistently outperformed ChatGPT-3.5 and resident physicians, with significant superiority in cardiovascular (ChatGPT-4 vs. ER physicians: p = 0.029) and endocrine or gastrointestinal diseases (ChatGPT-4 vs. ChatGPT-3.5: p = 0.014), while in other categories, the differences were not statistically significant.

Conclusions:

In this study comparing the diagnostic accuracy of ChatGPT-3.5, ChatGPT-4, and ER resident physicians against a discharge diagnosis gold standard, ChatGPT-4 outperformed both the resident physician and its predecessor, ChatGPT-3.5. Despite the study's retrospective design and limited sample size, its results underscore AI's potential as a supportive diagnostic tool in ER settings.

Citation

Please cite as:

Hoppe JM, Auer MK, Strüven A, Massberg S, Stremmel C

ChatGPT With GPT-4 Outperforms Emergency Department Physicians in Diagnostic Accuracy: Retrospective Analysis

J Med Internet Res 2024;26:e56110

DOI: 10.2196/56110

PMID: 38976865

PMCID: 11263899

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Jan 6, 2024

Date Accepted: May 8, 2024

ChatGPT-4 outperforms emergency department physicians in diagnostic accuracy: retrospective analysis

ABSTRACT

Citation

Copyright