JMIR Preprints #48659: Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: A Development and Usability Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: A Development and Usability Study

Arya Rao;
Michael Pang;
John Kim;
Meghana Kamineni;
Winston Lie;
Anoop Prasad;
Adam Landman;
Keith Dreyer;
Marc Succi

ABSTRACT

Background:

Large language model (LLM) artificial intelligence (AI) chatbots direct the power of large training datasets towards successive, related tasks, as opposed to single-ask tasks, for which AI already achieves impressive performance. The capacity of LLMs to assist in the full scope of iterative clinical reasoning via successive prompting, in effect acting as virtual physicians, has not yet been evaluated.

Objective:

To evaluate ChatGPT’s capacity for ongoing clinical decision support via its performance on standardized clinical vignettes.

Methods:

We inputted all 36 published clinical vignettes from the Merck Sharpe & Dohme (MSD) Clinical Manual into ChatGPT and compared accuracy on differential diagnoses, diagnostic testing, final diagnosis, and management based on patient age, gender, and case acuity. We measured the proportion of correct responses to the questions posed within the clinical vignettes tested.

Results:

ChatGPT achieved 71.7% (95% CI, 69.3% to 74.1%) accuracy overall across all 36 clinical vignettes. The LLM demonstrated the highest performance in making a final diagnosis with an accuracy of 76.9% (95% CI, 67.8% to 86.1%), and the lowest performance in generating an initial differential diagnosis with an accuracy of 60.3% (95% CI, 54.2% to 66.6%). Compared to answering questions about general medical knowledge, ChatGPT demonstrated inferior performance on differential diagnosis (β=-15.8%, p<0.001) and clinical management (β=-7.4%, p=0.02) type questions.

Conclusions:

ChatGPT achieves impressive accuracy in clinical decision making, with particular strengths emerging as it has more clinical information at its disposal.

Citation

Please cite as:

Rao A, Pang M, Kim J, Kamineni M, Lie W, Prasad A, Landman A, Dreyer K, Succi M

Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: Development and Usability Study

J Med Internet Res 2023;25:e48659

DOI: 10.2196/48659

PMID: 37606976

PMCID: 10481210

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: May 2, 2023

Date Accepted: Jul 27, 2023

Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: A Development and Usability Study

ABSTRACT

Citation

Copyright