Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Education

Date Submitted: May 20, 2023
Date Accepted: Oct 20, 2023

The final, peer-reviewed published version of this preprint can be found here:

ChatGPT Versus Consultants: Blinded Evaluation on Answering Otorhinolaryngology Case–Based Questions

Buhr CR, Smith H, Huppertz T, Bahr-Hamm K, Matthias C, Blaikie A, Kelsey T, Kuhn S, Eckrich J

ChatGPT Versus Consultants: Blinded Evaluation on Answering Otorhinolaryngology Case–Based Questions

JMIR Med Educ 2023;9:e49183

DOI: 10.2196/49183

PMID: 38051578

PMCID: 10731554

ChatGPT vs. Consultants: A Pilot Study on Answering Otorhinolaryngology Case-Based Questions

  • Christoph Raphael Buhr; 
  • Harry Smith; 
  • Tilman Huppertz; 
  • Katharina Bahr-Hamm; 
  • Christoph Matthias; 
  • Andrew Blaikie; 
  • Tom Kelsey; 
  • Sebastian Kuhn; 
  • Jonas Eckrich

ABSTRACT

Background:

Large language models (LLMs), like ChatGPT, are increasingly utilized in medicine and supplement standard search engines as information sources. This leads to more "consultations" of LLMs about personal medical symptoms.

Objective:

This study aims to evaluate ChatGPT's performance in answering clinical case-based questions in otorhinolaryngology (ORL) in comparison to ORL consultants' answers.

Methods:

We used 41 case-based questions from established ORL study books and past German state examinations for doctors. The questions were answered by both ORL consultants and ChatGPT 3. ORL consultants rated all responses, except their own, on medical adequacy, conciseness, coherence, and comprehensibility using a 6-step Likert-scale. They also identified if the answer was created by an ORL consultant or ChatGPT. Additionally, the character count was compared.

Results:

Ratings in all categories were significantly higher for ORL consultants. Although scores were inferior to the ORL consultants, ChatGPT's scores were relatively higher in semantic categories (conciseness, coherence, and comprehensibility) compared to medical adequacy. ORL consultants identified ChatGPT as the source in over 95% of cases. ChatGPT's answers had a significantly higher character count compared to ORL consultants.

Conclusions:

While ChatGPT provided longer answers to medical problems, medical adequacy and conciseness were significantly lower compared to ORL consultants' answers. LLMs have potential as augmentative tools for medical care, but their "consultation" for medical problems carries a high risk of misinformation, as their high semantic quality may mask contextual deficits. Clinical Trial: Written correspondence of March 3rd 2023 with the ethics committee of the regional medical association Rhineland-Palatinate determined no need for any specific ethical approval due to the use of anonymous text based questions.


 Citation

Please cite as:

Buhr CR, Smith H, Huppertz T, Bahr-Hamm K, Matthias C, Blaikie A, Kelsey T, Kuhn S, Eckrich J

ChatGPT Versus Consultants: Blinded Evaluation on Answering Otorhinolaryngology Case–Based Questions

JMIR Med Educ 2023;9:e49183

DOI: 10.2196/49183

PMID: 38051578

PMCID: 10731554

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.