Accepted for/Published in: JMIR mHealth and uHealth
Date Submitted: Jun 16, 2023
Open Peer Review Period: Jun 15, 2023 - Jul 3, 2023
Date Accepted: Aug 25, 2023
(closed for review but you can still tweet)
Comparison of diagnostic and triage accuracy of Ada and WebMD symptom checkers, ChatGPT and physicians for patients in an emergency department: clinical data analysis study
ABSTRACT
Background:
Diagnosis is a core component of effective healthcare, but misdiagnosis is common and can put patients at risk. Diagnostic decision support systems can play a role in improving diagnosis by physicians and other healthcare workers. Systems termed Symptom Checkers (SCs) have been designed to improve diagnosis, and triage (which level of care to seek) by patients. We studied the performance of the new Large Language Model ChatGPT, and the widely used WebMD SC in diagnosis and triage of patient with urgent or emergent clinical problems.
Objective:
To evaluate the diagnosis and triage performance of ChatGPT 3.5 and 4.0, and the WebMD and Ada SCs with data entered by patients in an Emergency Department (ED), compared to final ED diagnoses and physician reviews.
Methods:
In an earlier study in the ED at Rhode Island Hospital, USA, 40 patients were recruited to use the Ada SC to record their symptoms prior to seeing the ED physician. Deidentified data collected by Ada was entered into ChatGPT versions 3.5 and 4.0 and WebMD by a research assistant blinded to diagnoses and triage. Diagnoses were compared with the final diagnoses from the patient’s ED physician. Three independent ED physicians blindly reviewed the clinical data from Ada and gave their own diagnoses and triage recommendations. We calculated the proportion of the diagnoses from ChatGPT, Ada, WebMD and the independent physicians that matched at least one ED diagnosis (stratified Top 1 (M1) or Top 3 (M3)). Triage accuracy was calculated as the number of recommendations from ChatGPT or WebMD that agreed with at least 2 of the independent physicians, or were rated unsafe or too cautious.
Results:
Thirty cases had sufficient data for diagnostic analysis and 37 for triage analysis. For M1 diagnosis matches were: Ada =9 (30%); ChatGPT-3.5 =12 (40%); ChatGPT-4.0 =10 (33%); WebMD =12 (40%); physicians’ mean rate =47%. M3 diagnostic matches were: Ada =19 (63%); ChatGPT-3.5 =19 (63%); ChatGPT-4.0 =15 (50%); WebMD =17 (57%); physicians’ mean rate =69%. Triage accuracy results: Ada agree 23(62%), unsafe 5(14%), too cautious 9(24%); ChatGPT-3.5 agree 22(59%), unsafe 15(41%), too cautious 0(0%); ChatGPT-4.0 agree 28(76%), unsafe 8(22%), too cautious 1(3%); WebMD agree 26(70%), unsafe 7(19%), too cautious 4(11%). ChatGPT-3.5 unsafe triage rate of 41% was significantly higher than Ada’s at 15% (p=.0088).
Conclusions:
ChatGPT-3.5 had a high diagnostic accuracy but a very high unsafe triage rate. ChatGPT-4.0 had the poorest diagnostic accuracy, but a lower unsafe triage rate and the highest triage agreement with the physicians. The Ada and WebMD SCs performed better overall than ChatGPT. Unsupervised patient use of ChatGPT for diagnosis and triage is not recommended without improvements to triage accuracy, and extensive clinical evaluation.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.