Accepted for/Published in: JMIR Medical Education
Date Submitted: Mar 23, 2023
Date Accepted: Sep 5, 2023
Accuracy and racial bias of Generative Pre-trained Transformer-4 (GPT-4) for diagnosis and triage of health conditions
ABSTRACT
Background:
Whether Generative Pre-trained Transformer-4 (GPT-4), the conversational artificial intelligence, can accurately diagnose and triage health conditions, and has a racial bias in its decision, remains unclear.
Objective:
To assess the accuracy of GPT-4 in the diagnosis and triage of health conditions, and whether its performance varies by patient race.
Methods:
We compared the performance of GPT-4 and physicians using 45 typical clinical vignettes, each with a correct diagnosis and triage level in February and March 2023. For each of the 45 clinical vignettes, GPT-4 and three board-certified physicians provided the most likely primary diagnosis and triage level (emergency, non-emergency, or self-care). Independent reviewers evaluated the diagnoses as "correct" or "incorrect." Physician diagnosis was defined as the consensus of the three physicians. We evaluated whether the performance of GPT-4 differs by patient race, by adding the information on patient race to clinical vignettes.
Results:
The accuracy of diagnosis was comparable between GPT-4 and physicians (the percentage of correct diagnosis, 97.8% [95%CI 88.2%-99.9%] for GPT-4 vs. 91.1% [95%CI 78.8%-97.5%] for physicians; P=0.38). GPT-4 provided appropriate reasoning for 97.8% of vignettes. The appropriateness of triage was comparable between GPT-4 and physicians (appropriate triage, 66.7% [95%CI 51.0%-80.0%] for GPT-4 vs. 66.7% [95%CI 51.0%-80.0%] for physicians; P=0.99). The performance of GPT-4 to diagnose health conditions did not differ between Black and White patients when we added the information on patient race to clinical vignettes. The accuracy of triage was not majorly different even if patients’ race information was added (appropriate triage, 62.2% [46.5%-76.2%] for Black vs. 66.7% [51.0%-80.0%] for White; P=0.63).
Conclusions:
GPT-4's ability to diagnose and triage typical clinical vignettes was comparable to that of board-certified physicians. The performance of GPT-4 did not differ by patient race. These findings should be informative for health systems considering introducing conversational artificial intelligence to improve the efficiency of patient diagnosis and triage. Clinical Trial: None
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.