Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Apr 12, 2024
Date Accepted: Sep 12, 2024
Enhancing Diagnostic Accuracy through Multi-Agent Conversations: Using Large Language Models to Mitigate Cognitive Bias
ABSTRACT
Background:
Cognitive biases in clinical decision-making significantly contribute to errors in diagnosis and suboptimal patient outcomes. Addressing these biases presents a formidable challenge in the medical field.
Objective:
This study explores the role of large language models (LLMs) in mitigating these biases through the utilization of the multi-agent framework. We simulate the clinical decision-making processes through multi-agent conversation and evaluate its efficacy in improving diagnostic accuracy compared to humans.
Methods:
A total of 16 published and unpublished case reports where cognitive biases have resulted in misdiagnoses were identified from the literature. In the multi-agent framework, we leveraged GPT-4 to facilitate interactions among different simulated agents to replicate clinical team dynamics. Each agent was assigned a distinct role: 1) making the final diagnosis after considering the discussions, 2) acting as a devil’s advocate to correct confirmation and anchoring biases, 3) serving as a field expert in the required medical subspecialty, 4) facilitating discussions to mitigate premature closure bias, and 5) recording and summarizing findings. We tested varying combinations of these agents within the framework to determine which configuration yielded the highest rate of correct final diagnoses. Each scenario was repeated 5 times for consistency. The accuracy of the initial diagnoses and the final differential diagnoses were evaluated, and comparisons with human-generated answers were made using Fisher’s exact test.
Results:
A total of 240 responses were evaluated (3 different multi-agent frameworks). The initial diagnosis had an accuracy of 0% (0/80). However, following multi-agent discussions, the accuracy for the top two differential diagnoses increased to 76.3% for the best-performing multi-agent framework (Framework 4-C). This was significantly higher compared to the accuracy achieved by human evaluators (OR=3.49, p=0.002).
Conclusions:
The multi-agent framework demonstrated an ability to re-evaluate and correct misconceptions, even in scenarios with misleading initial investigations. Additionally, the LLM-driven multi-agent conversation framework shows promise in enhancing diagnostic accuracy in diagnostically challenging medical scenarios.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.