JMIR Preprints #96885: MedEvalArena: A Self-Generated, Peer-Judged Benchmark for Medical Reasoning

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

MedEvalArena: A Self-Generated, Peer-Judged Benchmark for Medical Reasoning

Vikasini Kuppa;
Preethi Prem;
Kie Shidara;
Esmé Wheeler;
Gene Chang;
Feng Liu;
Ahmed Alaa;
Danilo Bernardo

ABSTRACT

Background:

Large Language Models (LLMs) demonstrate strong performance at medical specialty board multiple choice question (MCQ) answering, however, underperform in more complex medical reasoning scenarios. This gap indicates a need for improving both LLM medical reasoning and evaluation paradigms.

Objective:

To develop an automated framework to evaluate LLM capabilities in medical reasoning.

Methods:

MedEvalArena is an automated framework in which LLMs engage in a symmetric round-robin format: each model generates challenging board-style medical MCQs, then serves in an ensemble LLM-as-judge bench to adjudicate validity of generated questions, and finally completes the validated exam as an examinee. We compared performance of leading LLMs across the OpenAI, Grok, Gemini, Claude, Kimi, and DeepSeek families on both question generation validity and exam taking performance.

Results:

Across frontier models, we observe no statistically significant differences in exam-taking performance with mean accuracies 85.7-91.7%, suggesting convergence in medical reasoning ability across frontier LLMs for question-answering tasks. LLM accuracy was comparable to mean human physician accuracy of 85.6% (95% CI: 79.4%-91.7%) and the differences were not statistically significant. We found significant differences between models in question validity rate, with higher question validity rates in questions generated by OpenAI, Gemini, and Claude frontier models (83.3%-94.8%), compared to Kimi, Grok, and DeepSeek models (46.0-63.8%). When jointly considering accuracy and inference cost, multiple frontier models lie on the Pareto frontier with no single model dominating across both dimensions.

Conclusions:

MedEvalArena provides an automated framework for benchmarking LLM medical reasoning, identifying valid question generation as a more discriminative task compared to question answering.

Citation

Please cite as:

Kuppa V, Prem P, Shidara K, Wheeler E, Chang G, Liu F, Alaa A, Bernardo D

MedEvalArena: A Self-Generated, Peer-Judged Benchmark for Medical Reasoning

JMIR Preprints. 01/04/2026:96885

DOI: 10.2196/preprints.96885

URL: https://preprints.jmir.org/preprint/96885

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Currently submitted to: Journal of Medical Internet Research

Date Submitted: Apr 1, 2026

Open Peer Review Period: Apr 1, 2026 - May 27, 2026

(currently open for review)

MedEvalArena: A Self-Generated, Peer-Judged Benchmark for Medical Reasoning

ABSTRACT

Citation

Copyright