Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Nov 15, 2023
Date Accepted: Apr 29, 2024

The final, peer-reviewed published version of this preprint can be found here:

Assessing Generative Pretrained Transformers (GPT) in Clinical Decision-Making: Comparative Analysis of GPT-3.5 and GPT-4

Lahat A, Sharif K, Zoabi N, Shneor Patt Y, Sharif Y, Fisher f, shani U, Arow M, Levin R, Klang E

Assessing Generative Pretrained Transformers (GPT) in Clinical Decision-Making: Comparative Analysis of GPT-3.5 and GPT-4

J Med Internet Res 2024;26:e54571

DOI: 10.2196/54571

PMID: 38935937

PMCID: 11240076

Assessing Generative Pretrained Transformers (GPT) in Clinical Decision-Making: A Comparative Analysis of GPT-3.5 and GPT-4

  • Adi Lahat; 
  • Kassem Sharif; 
  • Narmin Zoabi; 
  • Yonatan Shneor Patt; 
  • Yusra Sharif; 
  • fisher Fisher; 
  • Uria shani; 
  • Mohamad Arow; 
  • Roni Levin; 
  • Eyal Klang

ABSTRACT

Background:

Artificial Intelligence (AI), particularly chatbot systems, is becoming an instrumental tool in healthcare, aiding clinical decision-making and patient engagement

Objective:

To analyze the performance of Chat GPT-3.5 and Chat GPT-4 in addressing complex clinical and ethical dilemmas, and to illustrate their potential role in healthcare decision-making

Methods:

Four specialized physicians formulated 176 real-world clinical questions. Both senior physicians and residents evaluated the answers generated by GPT-3.5 and GPT-4 on 1-5 scale in 5 categories: accuracy, relevance, clarity, beneficial, Completeness.

Results:

Both GPT models received high scores ( 4.4 ± 0.8 for GPT-4 ,4.1 ± 1.0 for GPT-3.5 ).GPT-4 outperformed GPT-3.5 across all rating dimensions, with seniors consistently rating responses higher than residents for both models. Specifically, seniors rated GPT-4 as more beneficial and complete (4.6 vs 4.0 and 4.6 vs 4.1, respectively, p<0.001), and GPT-3.5 similarly (4.1 vs 3.7 and 3.9 vs 3.5, p<0.001). Ethical queries received the highest ratings for both models, with mean scores reflecting consistency across accuracy and completeness criteria. Distinctions among question types were significant, particularly for GPT-4's completeness across emergency, internal, and ethical questions (4.2 ± 1.0, 4.3 ± 0.8, 4.5 ± 0.7; p < 0.001), and for GPT-3.5's accuracy, beneficial, and completeness dimensions

Conclusions:

Chat GPT's potential to assist physicians with medical issues is promising, with prospects to enhance diagnostics, treatments, and ethics. While integration into clinical workflows may be valuable, it must complement, not replace, human expertise. Continued research is essential to ensure safe and effective implementation in clinical environments. Clinical Trial: N/A


 Citation

Please cite as:

Lahat A, Sharif K, Zoabi N, Shneor Patt Y, Sharif Y, Fisher f, shani U, Arow M, Levin R, Klang E

Assessing Generative Pretrained Transformers (GPT) in Clinical Decision-Making: Comparative Analysis of GPT-3.5 and GPT-4

J Med Internet Res 2024;26:e54571

DOI: 10.2196/54571

PMID: 38935937

PMCID: 11240076

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.