Accepted for/Published in: JMIRx Med
Date Submitted: Oct 17, 2024
Open Peer Review Period: Oct 17, 2024 - Dec 12, 2024
Date Accepted: Jul 23, 2025
(closed for review but you can still tweet)
Rapidly benchmarking Large Language Models for diagnosing comorbid patients: A comparative study leveraging the LLM-as-a-judge method
ABSTRACT
Background:
On average, one in ten patients die because of a diagnostic error and medical errors are the third largest cause of death in the US.
Objective:
While LLMs have been proposed to help doctors with diagnoses, no research results have been published on comparing the diagnostic ability of many popular LLMs on an openly accessible real-patient cohort.
Methods:
In thus study, we compare LLMs from Google, OpenAI, Meta, Mistral, Co-here and Anthropic using a previously established evaluation methodology and explore improving their accuracy with RAG.
Results:
We found that GPT-4o from OpenAI and Claude Sonnet 3.5 from Anthropic were the top performers with them only missing 0.5% of ground truth conditions that were clearly inferable from the available data; RAG further improved this error rate to 0.2%.
Conclusions:
While the results are promising, more diverse datasets, hospital pilots and close collaboration with physicians are needed to get a better understanding of the diagnostic ability of these models.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.