Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: May 2, 2023
Date Accepted: Jul 27, 2023

The final, peer-reviewed published version of this preprint can be found here:

Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: Development and Usability Study

Rao A, Pang M, Kim J, Kamineni M, Lie W, Prasad A, Landman A, Dreyer K, Succi M

Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: Development and Usability Study

J Med Internet Res 2023;25:e48659

DOI: 10.2196/48659

PMID: 37606976

PMCID: 10481210

Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: A Development and Usability Study

  • Arya Rao; 
  • Michael Pang; 
  • John Kim; 
  • Meghana Kamineni; 
  • Winston Lie; 
  • Anoop Prasad; 
  • Adam Landman; 
  • Keith Dreyer; 
  • Marc Succi

ABSTRACT

Background:

Large language model (LLM) artificial intelligence (AI) chatbots direct the power of large training datasets towards successive, related tasks, as opposed to single-ask tasks, for which AI already achieves impressive performance. The capacity of LLMs to assist in the full scope of iterative clinical reasoning via successive prompting, in effect acting as virtual physicians, has not yet been evaluated.

Objective:

To evaluate ChatGPT’s capacity for ongoing clinical decision support via its performance on standardized clinical vignettes.

Methods:

We inputted all 36 published clinical vignettes from the Merck Sharpe & Dohme (MSD) Clinical Manual into ChatGPT and compared accuracy on differential diagnoses, diagnostic testing, final diagnosis, and management based on patient age, gender, and case acuity. We measured the proportion of correct responses to the questions posed within the clinical vignettes tested.

Results:

ChatGPT achieved 71.7% (95% CI, 69.3% to 74.1%) accuracy overall across all 36 clinical vignettes. The LLM demonstrated the highest performance in making a final diagnosis with an accuracy of 76.9% (95% CI, 67.8% to 86.1%), and the lowest performance in generating an initial differential diagnosis with an accuracy of 60.3% (95% CI, 54.2% to 66.6%). Compared to answering questions about general medical knowledge, ChatGPT demonstrated inferior performance on differential diagnosis (β=-15.8%, p<0.001) and clinical management (β=-7.4%, p=0.02) type questions.

Conclusions:

ChatGPT achieves impressive accuracy in clinical decision making, with particular strengths emerging as it has more clinical information at its disposal.


 Citation

Please cite as:

Rao A, Pang M, Kim J, Kamineni M, Lie W, Prasad A, Landman A, Dreyer K, Succi M

Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: Development and Usability Study

J Med Internet Res 2023;25:e48659

DOI: 10.2196/48659

PMID: 37606976

PMCID: 10481210

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.