Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR mHealth and uHealth

Date Submitted: Jun 16, 2023
Open Peer Review Period: Jun 15, 2023 - Jul 3, 2023
Date Accepted: Aug 25, 2023
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Comparison of Diagnostic and Triage Accuracy of Ada Health and WebMD Symptom Checkers, ChatGPT, and Physicians for Patients in an Emergency Department: Clinical Data Analysis Study

Fraser H, Crossland D, Bacher I, Ranney M, Madsen T, Hilliard R

Comparison of Diagnostic and Triage Accuracy of Ada Health and WebMD Symptom Checkers, ChatGPT, and Physicians for Patients in an Emergency Department: Clinical Data Analysis Study

JMIR Mhealth Uhealth 2023;11:e49995

DOI: 10.2196/49995

PMID: 37788063

PMCID: 10582809

Comparison of diagnostic and triage accuracy of Ada and WebMD symptom checkers, ChatGPT and physicians for patients in an emergency department: clinical data analysis study

  • Hamish Fraser; 
  • Daven Crossland; 
  • Ian Bacher; 
  • Megan Ranney; 
  • Tracy Madsen; 
  • Ross Hilliard

ABSTRACT

Background:

Diagnosis is a core component of effective healthcare, but misdiagnosis is common and can put patients at risk. Diagnostic decision support systems can play a role in improving diagnosis by physicians and other healthcare workers. Systems termed Symptom Checkers (SCs) have been designed to improve diagnosis, and triage (which level of care to seek) by patients. We studied the performance of the new Large Language Model ChatGPT, and the widely used WebMD SC in diagnosis and triage of patient with urgent or emergent clinical problems.

Objective:

To evaluate the diagnosis and triage performance of ChatGPT 3.5 and 4.0, and the WebMD and Ada SCs with data entered by patients in an Emergency Department (ED), compared to final ED diagnoses and physician reviews.

Methods:

In an earlier study in the ED at Rhode Island Hospital, USA, 40 patients were recruited to use the Ada SC to record their symptoms prior to seeing the ED physician. Deidentified data collected by Ada was entered into ChatGPT versions 3.5 and 4.0 and WebMD by a research assistant blinded to diagnoses and triage. Diagnoses were compared with the final diagnoses from the patient’s ED physician. Three independent ED physicians blindly reviewed the clinical data from Ada and gave their own diagnoses and triage recommendations. We calculated the proportion of the diagnoses from ChatGPT, Ada, WebMD and the independent physicians that matched at least one ED diagnosis (stratified Top 1 (M1) or Top 3 (M3)). Triage accuracy was calculated as the number of recommendations from ChatGPT or WebMD that agreed with at least 2 of the independent physicians, or were rated unsafe or too cautious.

Results:

Thirty cases had sufficient data for diagnostic analysis and 37 for triage analysis. For M1 diagnosis matches were: Ada =9 (30%); ChatGPT-3.5 =12 (40%); ChatGPT-4.0 =10 (33%); WebMD =12 (40%); physicians’ mean rate =47%. M3 diagnostic matches were: Ada =19 (63%); ChatGPT-3.5 =19 (63%); ChatGPT-4.0 =15 (50%); WebMD =17 (57%); physicians’ mean rate =69%. Triage accuracy results: Ada agree 23(62%), unsafe 5(14%), too cautious 9(24%); ChatGPT-3.5 agree 22(59%), unsafe 15(41%), too cautious 0(0%); ChatGPT-4.0 agree 28(76%), unsafe 8(22%), too cautious 1(3%); WebMD agree 26(70%), unsafe 7(19%), too cautious 4(11%). ChatGPT-3.5 unsafe triage rate of 41% was significantly higher than Ada’s at 15% (p=.0088).

Conclusions:

ChatGPT-3.5 had a high diagnostic accuracy but a very high unsafe triage rate. ChatGPT-4.0 had the poorest diagnostic accuracy, but a lower unsafe triage rate and the highest triage agreement with the physicians. The Ada and WebMD SCs performed better overall than ChatGPT. Unsupervised patient use of ChatGPT for diagnosis and triage is not recommended without improvements to triage accuracy, and extensive clinical evaluation.


 Citation

Please cite as:

Fraser H, Crossland D, Bacher I, Ranney M, Madsen T, Hilliard R

Comparison of Diagnostic and Triage Accuracy of Ada Health and WebMD Symptom Checkers, ChatGPT, and Physicians for Patients in an Emergency Department: Clinical Data Analysis Study

JMIR Mhealth Uhealth 2023;11:e49995

DOI: 10.2196/49995

PMID: 37788063

PMCID: 10582809

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.