Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Education

Date Submitted: Jul 30, 2023
Date Accepted: Dec 11, 2023

The final, peer-reviewed published version of this preprint can be found here:

Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models

Abdullahi T, Singh R, Eickhoff C

Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models

JMIR Med Educ 2024;10:e51391

DOI: 10.2196/51391

PMID: 38349725

PMCID: 10900078

Learning to Make Rare and Complex Diagnoses with Generative AI Assistance

  • Tassallah Abdullahi; 
  • Ritambhara Singh; 
  • Carsten Eickhoff

ABSTRACT

Background:

Patients with rare and complex conditions often experience delayed diagnoses and misdiagnoses, as comprehensive knowledge about these diseases is limited to only a few medical experts. In this context, large language models (LLMs) have emerged as powerful knowledge aggregation tools with applications in clinical decision support and education domains.

Objective:

This study explores the potential of three popular LLMs - Bard, GPT3.5, and GPT 4 in medical education to enhance the diagnosis of rare and complex diseases while investigating the impact of prompt engineering on their performance.

Methods:

To achieve these objectives, we conducted experiments on publicly available complex and rare cases. We implemented various prompt strategies, evaluating the performance of models with both open-ended and multiple-choice prompts. Additionally, we employed a majority voting strategy to leverage diverse reasoning paths within language models, aiming to enhance their reliability.

Results:

Remarkably, all LLMs outperformed the average human consensus, with a minimum margin of 5% across all 30 DC3 complex case challenges. In addition, Bard outperformed the human average consensus by 14% on the frequently misdiagnosed cases, while GPT4 and ChatGPT-3.5 surpassed the performance of human respondents on the moderately misdiagnosed cases with a minimum margin of 11%. On the MIMIC-III datasets, Bard and GPT4 achieved a diagnostic accuracy score of 93%, while ChatGPT-3.5 scored 73%. Furthermore, the majority voting strategy, particularly with GPT4, demonstrated the highest overall score across all DC3 cases, surpassing other LLMs. Our results also demonstrate that no one-size-fits-all prompting approach for improving LLMs' performance exists, and a single strategy does not universally apply to all LLMs.

Conclusions:

Our findings shed light on the diagnostic capabilities of LLMs and the challenges associated with identifying an optimal prompting strategy that aligns with each language model's characteristics and specific task requirements. The significance of prompt engineering is highlighted, providing valuable insights for researchers and practitioners utilizing these language models for medical training. Furthermore, this research represents a crucial step toward understanding how LLMs can enhance diagnostic reasoning in rare and complex medical cases, paving the way for developing effective educational tools and accurate diagnostic aids to improve patient care and outcomes.


 Citation

Please cite as:

Abdullahi T, Singh R, Eickhoff C

Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models

JMIR Med Educ 2024;10:e51391

DOI: 10.2196/51391

PMID: 38349725

PMCID: 10900078

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.