JMIR Preprints #99512: Comparative performance of ChatGPT 5.1 thinking and Gemini Pro 3 thinking on the 2025 Israeli Orthopaedic In-Training Examination: a paired question-level analysis

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Comparative performance of ChatGPT 5.1 thinking and Gemini Pro 3 thinking on the 2025 Israeli Orthopaedic In-Training Examination: a paired question-level analysis

Ofir Vinograd;
Erez Avisar;
Ahmad Essa;
Ronen Debi;
Orna Tal

ABSTRACT

Background:

Large language models (LLMs) are increasingly used in orthopaedic education, but their performance on the 2025 Israeli Orthopaedic In-Training Examination (OITE) remains poorly defined.

Objective:

To compare the performance of two contemporary reasoning-oriented LLMs on the 2025 Israeli OITE and to identify item-level factors associated with correctness.

Methods:

All 95 multiple-choice questions (MCQs) from the 2025 OITE were analyzed. ChatGPT 5.1 thinking and Gemini Pro 3 thinking each answered all questions once using a standardized prompt. The primary outcome was question-level correctness according to the official answer key. Questions were annotated for question type, question stem word count, image presence, and model-reported certainty. Model-specific factors associated with correctness were assessed using logistic regression, and head-to-head comparisons were performed using the exact McNemar test and paired generalized estimating equations (GEE).

Results:

Both models achieved identical overall accuracy (68/95, 71.6%), with no measurable difference on paired comparison (odds ratio [OR], 1.00; 95% CI, 0.58-1.73; P=1.000). Accuracy was substantially higher on questions without images than on image-containing questions (90.5% vs 56.6%). In model-specific multivariable analyses, image presence was independently associated with lower correctness for both ChatGPT (adjusted OR [aOR], 0.12; 95% CI, 0.04-0.40; P<.001) and Gemini (aOR, 0.08; 95% CI, 0.02-0.28; P<.001). For Gemini, longer question stem word count was independently associated with greater odds of correctness (aOR per 10 words, 1.47; 95% CI, 1.01-2.13; P=.042), and Application questions were associated with lower odds of correctness than Knowledge questions (aOR, 0.26; 95% CI, 0.08-0.85; P=.026). In the primary paired GEE model, no measurable difference in correctness was observed between Gemini and ChatGPT (OR, 1.00; 95% CI, 0.58-1.73; P=1.000), whereas Application questions (OR, 0.34; 95% CI, 0.14-0.84; P=.019) and image-containing questions (OR, 0.09; 95% CI, 0.03-0.24; P<.001) were associated with lower odds of correctness.

Conclusions:

ChatGPT 5.1 thinking and Gemini Pro 3 thinking demonstrated similar overall performance on the 2025 Israeli OITE, with identical crude accuracy and no measurable difference in paired comparative analyses. Both models were substantially less accurate on image-containing questions, and Application questions were also associated with lower correctness, particularly in Gemini-specific and paired analyses. These findings suggest that current LLMs may have value as adjunctive educational tools in orthopaedic knowledge domains, but they remain insufficiently reliable for unsupervised use, particularly in image-dependent and application-based settings. Clinical Trial: not applicable

Citation

Please cite as:

Vinograd O, Avisar E, Essa A, Debi R, Tal O

Comparative performance of ChatGPT 5.1 thinking and Gemini Pro 3 thinking on the 2025 Israeli Orthopaedic In-Training Examination: a paired question-level analysis

JMIR Preprints. 26/04/2026:99512

DOI: 10.2196/preprints.99512

URL: https://preprints.jmir.org/preprint/99512

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Currently submitted to: JMIR Formative Research

Date Submitted: Apr 26, 2026

Open Peer Review Period: May 4, 2026 - Jun 29, 2026

(currently open for review)

Comparative performance of ChatGPT 5.1 thinking and Gemini Pro 3 thinking on the 2025 Israeli Orthopaedic In-Training Examination: a paired question-level analysis

ABSTRACT

Citation

Copyright