JMIR Preprints #54571: Assessing Generative Pretrained Transformers (GPT) in Clinical Decision-Making: A Comparative Analysis of GPT-3.5 and GPT-4

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Assessing Generative Pretrained Transformers (GPT) in Clinical Decision-Making: A Comparative Analysis of GPT-3.5 and GPT-4

Adi Lahat;
Kassem Sharif;
Narmin Zoabi;
Yonatan Shneor Patt;
Yusra Sharif;
fisher Fisher;
Uria shani;
Mohamad Arow;
Roni Levin;
Eyal Klang

ABSTRACT

Background:

Artificial Intelligence (AI), particularly chatbot systems, is becoming an instrumental tool in healthcare, aiding clinical decision-making and patient engagement

Objective:

To analyze the performance of Chat GPT-3.5 and Chat GPT-4 in addressing complex clinical and ethical dilemmas, and to illustrate their potential role in healthcare decision-making

Methods:

Four specialized physicians formulated 176 real-world clinical questions. Both senior physicians and residents evaluated the answers generated by GPT-3.5 and GPT-4 on 1-5 scale in 5 categories: accuracy, relevance, clarity, beneficial, Completeness.

Results:

Both GPT models received high scores ( 4.4 ± 0.8 for GPT-4 ,4.1 ± 1.0 for GPT-3.5 ).GPT-4 outperformed GPT-3.5 across all rating dimensions, with seniors consistently rating responses higher than residents for both models. Specifically, seniors rated GPT-4 as more beneficial and complete (4.6 vs 4.0 and 4.6 vs 4.1, respectively, p<0.001), and GPT-3.5 similarly (4.1 vs 3.7 and 3.9 vs 3.5, p<0.001). Ethical queries received the highest ratings for both models, with mean scores reflecting consistency across accuracy and completeness criteria. Distinctions among question types were significant, particularly for GPT-4's completeness across emergency, internal, and ethical questions (4.2 ± 1.0, 4.3 ± 0.8, 4.5 ± 0.7; p < 0.001), and for GPT-3.5's accuracy, beneficial, and completeness dimensions

Conclusions:

Chat GPT's potential to assist physicians with medical issues is promising, with prospects to enhance diagnostics, treatments, and ethics. While integration into clinical workflows may be valuable, it must complement, not replace, human expertise. Continued research is essential to ensure safe and effective implementation in clinical environments. Clinical Trial: N/A

Citation

Please cite as:

Lahat A, Sharif K, Zoabi N, Shneor Patt Y, Sharif Y, Fisher f, shani U, Arow M, Levin R, Klang E

Assessing Generative Pretrained Transformers (GPT) in Clinical Decision-Making: Comparative Analysis of GPT-3.5 and GPT-4

J Med Internet Res 2024;26:e54571

DOI: 10.2196/54571

PMID: 38935937

PMCID: 11240076

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Nov 15, 2023

Date Accepted: Apr 29, 2024

Assessing Generative Pretrained Transformers (GPT) in Clinical Decision-Making: A Comparative Analysis of GPT-3.5 and GPT-4

ABSTRACT

Citation

Copyright