Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Jan 6, 2024
Date Accepted: May 8, 2024

The final, peer-reviewed published version of this preprint can be found here:

ChatGPT With GPT-4 Outperforms Emergency Department Physicians in Diagnostic Accuracy: Retrospective Analysis

Hoppe JM, Auer MK, Strüven A, Massberg S, Stremmel C

ChatGPT With GPT-4 Outperforms Emergency Department Physicians in Diagnostic Accuracy: Retrospective Analysis

J Med Internet Res 2024;26:e56110

DOI: 10.2196/56110

PMID: 38976865

PMCID: 11263899

ChatGPT-4 outperforms emergency department physicians in diagnostic accuracy: retrospective analysis

  • John Michael Hoppe; 
  • Matthias K. Auer; 
  • Anna Strüven; 
  • Steffen Massberg; 
  • Christopher Stremmel

ABSTRACT

Background:

OpenAI's Chat Generative Pretrained Transformer (ChatGPT) is a pioneering artificial intelligence (AI) in natural language processing and offers significant potential in medicine for treatment advice. Also, recent studies show promising results using ChatGPT for emergency medicine triage, however its diagnostic accuracy in the emergency department has not been evaluated.

Objective:

This study compares the diagnostic accuracy of ChatGPT versions 3.5 and 4 against primary treating resident physicians in an emergency room (ER) setting.

Methods:

In 100 adults admitted to our ER in January 2023 for internal medicine issues the diagnostic accuracy was assessed by comparing the diagnoses of ER resident physicians and ChatGPT versions 3.5 and 4 against the final hospital discharge diagnosis, using a point system for grading accuracy.

Results:

The enrolled 100 patients, with a median age of 72, were admitted to our internal medicine emergency department, primarily for cardiovascular, endocrine or gastrointestinal and infectious diseases. ChatGPT-4 outperformed both ChatGPT-3.5 (p < 0.001) and ER resident physicians (p = 0.012) in diagnostic accuracy for internal medicine emergencies. Also, across various disease subgroups, ChatGPT-4 consistently outperformed ChatGPT-3.5 and resident physicians, with significant superiority in cardiovascular (ChatGPT-4 vs. ER physicians: p = 0.029) and endocrine or gastrointestinal diseases (ChatGPT-4 vs. ChatGPT-3.5: p = 0.014), while in other categories, the differences were not statistically significant.

Conclusions:

In this study comparing the diagnostic accuracy of ChatGPT-3.5, ChatGPT-4, and ER resident physicians against a discharge diagnosis gold standard, ChatGPT-4 outperformed both the resident physician and its predecessor, ChatGPT-3.5. Despite the study's retrospective design and limited sample size, its results underscore AI's potential as a supportive diagnostic tool in ER settings.


 Citation

Please cite as:

Hoppe JM, Auer MK, Strüven A, Massberg S, Stremmel C

ChatGPT With GPT-4 Outperforms Emergency Department Physicians in Diagnostic Accuracy: Retrospective Analysis

J Med Internet Res 2024;26:e56110

DOI: 10.2196/56110

PMID: 38976865

PMCID: 11263899

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.