JMIR Preprints #47532: Accuracy and racial bias of Generative Pre-trained Transformer-4 (GPT-4) for diagnosis and triage of health conditions

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Accuracy and racial bias of Generative Pre-trained Transformer-4 (GPT-4) for diagnosis and triage of health conditions

Tadahiro Goto;
Naoki Ito;
Sakina Kadomatsu;
Mineto Fujisawa;
Kiyomitsu Fukaguchi;
Ryo Ishizawa;
Naoki Kanda;
Daisuke Kasugai;
Mikio Nakajima;
Yusuke Tsugawa

ABSTRACT

Background:

Whether Generative Pre-trained Transformer-4 (GPT-4), the conversational artificial intelligence, can accurately diagnose and triage health conditions, and has a racial bias in its decision, remains unclear.

Objective:

To assess the accuracy of GPT-4 in the diagnosis and triage of health conditions, and whether its performance varies by patient race.

Methods:

We compared the performance of GPT-4 and physicians using 45 typical clinical vignettes, each with a correct diagnosis and triage level in February and March 2023. For each of the 45 clinical vignettes, GPT-4 and three board-certified physicians provided the most likely primary diagnosis and triage level (emergency, non-emergency, or self-care). Independent reviewers evaluated the diagnoses as "correct" or "incorrect." Physician diagnosis was defined as the consensus of the three physicians. We evaluated whether the performance of GPT-4 differs by patient race, by adding the information on patient race to clinical vignettes.

Results:

The accuracy of diagnosis was comparable between GPT-4 and physicians (the percentage of correct diagnosis, 97.8% [95%CI 88.2%-99.9%] for GPT-4 vs. 91.1% [95%CI 78.8%-97.5%] for physicians; P=0.38). GPT-4 provided appropriate reasoning for 97.8% of vignettes. The appropriateness of triage was comparable between GPT-4 and physicians (appropriate triage, 66.7% [95%CI 51.0%-80.0%] for GPT-4 vs. 66.7% [95%CI 51.0%-80.0%] for physicians; P=0.99). The performance of GPT-4 to diagnose health conditions did not differ between Black and White patients when we added the information on patient race to clinical vignettes. The accuracy of triage was not majorly different even if patients’ race information was added (appropriate triage, 62.2% [46.5%-76.2%] for Black vs. 66.7% [51.0%-80.0%] for White; P=0.63).

Conclusions:

GPT-4's ability to diagnose and triage typical clinical vignettes was comparable to that of board-certified physicians. The performance of GPT-4 did not differ by patient race. These findings should be informative for health systems considering introducing conversational artificial intelligence to improve the efficiency of patient diagnosis and triage. Clinical Trial: None

Citation

Please cite as:

Goto T, Ito N, Kadomatsu S, Fujisawa M, Fukaguchi K, Ishizawa R, Kanda N, Kasugai D, Nakajima M, Tsugawa Y

The Accuracy and Potential Racial and Ethnic Biases of GPT-4 in the Diagnosis and Triage of Health Conditions: Evaluation Study

JMIR Med Educ 2023;9:e47532

DOI: 10.2196/47532

PMID: 37917120

PMCID: 10654908

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Education

Date Submitted: Mar 23, 2023

Date Accepted: Sep 5, 2023

Accuracy and racial bias of Generative Pre-trained Transformer-4 (GPT-4) for diagnosis and triage of health conditions

ABSTRACT

Citation

Copyright