Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Formative Research

Date Submitted: Apr 8, 2024
Open Peer Review Period: Apr 9, 2024 - Jun 4, 2024
Date Accepted: May 4, 2024
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Evaluating ChatGPT-4’s Accuracy in Identifying Final Diagnoses Within Differential Diagnoses Compared With Those of Physicians: Experimental Study for Diagnostic Cases

Hirosawa T, Harada Y, Mizuta K, Sakamoto T, Tokumasu K, Shimizu T

Evaluating ChatGPT-4’s Accuracy in Identifying Final Diagnoses Within Differential Diagnoses Compared With Those of Physicians: Experimental Study for Diagnostic Cases

JMIR Form Res 2024;8:e59267

DOI: 10.2196/59267

PMID: 38924784

PMCID: 11237772

Evaluating ChatGPT-4's accuracy in identifying final diagnoses within differential diagnoses compared to those of physicians: an experimental study for diagnostic cases

  • Takanobu Hirosawa; 
  • Yukinori Harada; 
  • Kazuya Mizuta; 
  • Tetsu Sakamoto; 
  • Kazuki Tokumasu; 
  • Taro Shimizu

ABSTRACT

Background:

The potential of artificial intelligence (AI) chatbots, particularly the fourth-generation chat generative pretrained transformer (ChatGPT-4), in assisting with medical diagnosis is an emerging research area. However, it is not yet clear how well AI chatbots can evaluate whether the final diagnosis is included in differential-diagnosis lists.

Objective:

This study aimed to assess the capability of ChatGPT-4 in identifying the final diagnosis from differential-diagnosis lists, and to compare its performance with that of physicians, for case report series.

Methods:

We utilized a database of differential-diagnosis lists from case reports in the American Journal of Case Reports, corresponding to final diagnoses. These lists were generated by three artificial intelligence (AI) systems: ChatGPT-4, Google Bard (currently Google Gemini), and Large Language Models by Meta AI 2 chatbot. The primary outcome was focused on whether ChatGPT-4's evaluations identified the final diagnosis within these lists. None of these AIs received additional medical training or reinforcement. For comparison, two independent physicians also evaluated the lists, with any inconsistencies resolved by another physician.

Results:

Three AIs generated a total of 1,176 differential diagnosis lists from 392 case descriptions. ChatGPT-4's evaluations concurred with those of the physicians in 966 out of 1,176 lists (82.1%). The Cohen kappa coefficient was 0.63 (95% confidence interval: 0.56-0.69), indicating a fair to good agreement between ChatGPT-4 and the physicians' evaluations.

Conclusions:

ChatGPT-4 demonstrated a fair to good agreement in identifying the final diagnosis from differential-diagnosis lists, comparable to physicians for case report series. Its ability to compare differential-diagnosis lists with final diagnoses suggests its potential in aiding clinical decision-making support through diagnostic feedback. While ChatGPT-4 showed a fair to good agreement for evaluation, its application in real-world scenarios and further validation in diverse clinical environments are essential to fully understand its utility in the diagnostic process. Clinical Trial: Not applicable


 Citation

Please cite as:

Hirosawa T, Harada Y, Mizuta K, Sakamoto T, Tokumasu K, Shimizu T

Evaluating ChatGPT-4’s Accuracy in Identifying Final Diagnoses Within Differential Diagnoses Compared With Those of Physicians: Experimental Study for Diagnostic Cases

JMIR Form Res 2024;8:e59267

DOI: 10.2196/59267

PMID: 38924784

PMCID: 11237772

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.