JMIR Preprints #67661: Rapidly benchmarking Large Language Models for diagnosing comorbid patients: A comparative study leveraging the LLM-as-a-judge method

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Rapidly benchmarking Large Language Models for diagnosing comorbid patients: A comparative study leveraging the LLM-as-a-judge method

Peter Sarvari;
Zaid Al-fagih

ABSTRACT

Background:

On average, one in ten patients die because of a diagnostic error and medical errors are the third largest cause of death in the US.

Objective:

While LLMs have been proposed to help doctors with diagnoses, no research results have been published on comparing the diagnostic ability of many popular LLMs on an openly accessible real-patient cohort.

Methods:

In thus study, we compare LLMs from Google, OpenAI, Meta, Mistral, Co-here and Anthropic using a previously established evaluation methodology and explore improving their accuracy with RAG.

Results:

We found that GPT-4o from OpenAI and Claude Sonnet 3.5 from Anthropic were the top performers with them only missing 0.5% of ground truth conditions that were clearly inferable from the available data; RAG further improved this error rate to 0.2%.

Conclusions:

While the results are promising, more diverse datasets, hospital pilots and close collaboration with physicians are needed to get a better understanding of the diagnostic ability of these models.

Citation

Please cite as:

Sarvari P, Al-fagih Z

Rapidly Benchmarking Large Language Models for Diagnosing Comorbid Patients: Comparative Study Leveraging the LLM-as-a-Judge Method

JMIRx Med 2025;6:e67661

DOI: 10.2196/67661

PMID: 40880236

PMCID: 12396308

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIRx Med

Date Submitted: Oct 17, 2024

Open Peer Review Period: Oct 17, 2024 - Dec 12, 2024

Date Accepted: Jul 23, 2025

(closed for review but you can still tweet)

Rapidly benchmarking Large Language Models for diagnosing comorbid patients: A comparative study leveraging the LLM-as-a-judge method

ABSTRACT

Citation

Copyright