Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIRx Med

Date Submitted: Oct 17, 2024
Open Peer Review Period: Oct 17, 2024 - Dec 12, 2024
Date Accepted: Jul 23, 2025
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Rapidly Benchmarking Large Language Models for Diagnosing Comorbid Patients: Comparative Study Leveraging the LLM-as-a-Judge Method

Sarvari P, Al-fagih Z

Rapidly Benchmarking Large Language Models for Diagnosing Comorbid Patients: Comparative Study Leveraging the LLM-as-a-Judge Method

JMIRx Med 2025;6:e67661

DOI: 10.2196/67661

PMID: 40880236

PMCID: 12396308

Rapidly benchmarking Large Language Models for diagnosing comorbid patients: A comparative study leveraging the LLM-as-a-judge method

  • Peter Sarvari; 
  • Zaid Al-fagih

ABSTRACT

Background:

On average, one in ten patients die because of a diagnostic error and medical errors are the third largest cause of death in the US.

Objective:

While LLMs have been proposed to help doctors with diagnoses, no research results have been published on comparing the diagnostic ability of many popular LLMs on an openly accessible real-patient cohort.

Methods:

In thus study, we compare LLMs from Google, OpenAI, Meta, Mistral, Co-here and Anthropic using a previously established evaluation methodology and explore improving their accuracy with RAG.

Results:

We found that GPT-4o from OpenAI and Claude Sonnet 3.5 from Anthropic were the top performers with them only missing 0.5% of ground truth conditions that were clearly inferable from the available data; RAG further improved this error rate to 0.2%.

Conclusions:

While the results are promising, more diverse datasets, hospital pilots and close collaboration with physicians are needed to get a better understanding of the diagnostic ability of these models.


 Citation

Please cite as:

Sarvari P, Al-fagih Z

Rapidly Benchmarking Large Language Models for Diagnosing Comorbid Patients: Comparative Study Leveraging the LLM-as-a-Judge Method

JMIRx Med 2025;6:e67661

DOI: 10.2196/67661

PMID: 40880236

PMCID: 12396308

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.