Accepted for/Published in: JMIR Medical Education
Date Submitted: Jul 30, 2023
Date Accepted: Dec 11, 2023
Learning to Make Rare and Complex Diagnoses with Generative AI Assistance
ABSTRACT
Background:
Patients with rare and complex conditions often experience delayed diagnoses and misdiagnoses, as comprehensive knowledge about these diseases is limited to only a few medical experts. In this context, large language models (LLMs) have emerged as powerful knowledge aggregation tools with applications in clinical decision support and education domains.
Objective:
This study explores the potential of three popular LLMs - Bard, GPT3.5, and GPT 4 in medical education to enhance the diagnosis of rare and complex diseases while investigating the impact of prompt engineering on their performance.
Methods:
To achieve these objectives, we conducted experiments on publicly available complex and rare cases. We implemented various prompt strategies, evaluating the performance of models with both open-ended and multiple-choice prompts. Additionally, we employed a majority voting strategy to leverage diverse reasoning paths within language models, aiming to enhance their reliability.
Results:
Remarkably, all LLMs outperformed the average human consensus, with a minimum margin of 5% across all 30 DC3 complex case challenges. In addition, Bard outperformed the human average consensus by 14% on the frequently misdiagnosed cases, while GPT4 and ChatGPT-3.5 surpassed the performance of human respondents on the moderately misdiagnosed cases with a minimum margin of 11%. On the MIMIC-III datasets, Bard and GPT4 achieved a diagnostic accuracy score of 93%, while ChatGPT-3.5 scored 73%. Furthermore, the majority voting strategy, particularly with GPT4, demonstrated the highest overall score across all DC3 cases, surpassing other LLMs. Our results also demonstrate that no one-size-fits-all prompting approach for improving LLMs' performance exists, and a single strategy does not universally apply to all LLMs.
Conclusions:
Our findings shed light on the diagnostic capabilities of LLMs and the challenges associated with identifying an optimal prompting strategy that aligns with each language model's characteristics and specific task requirements. The significance of prompt engineering is highlighted, providing valuable insights for researchers and practitioners utilizing these language models for medical training. Furthermore, this research represents a crucial step toward understanding how LLMs can enhance diagnostic reasoning in rare and complex medical cases, paving the way for developing effective educational tools and accurate diagnostic aids to improve patient care and outcomes.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.