Currently submitted to: JMIR Formative Research
Date Submitted: Apr 26, 2026
Open Peer Review Period: May 4, 2026 - Jun 29, 2026
(currently open for review)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Comparative performance of ChatGPT 5.1 thinking and Gemini Pro 3 thinking on the 2025 Israeli Orthopaedic In-Training Examination: a paired question-level analysis
ABSTRACT
Background:
Large language models (LLMs) are increasingly used in orthopaedic education, but their performance on the 2025 Israeli Orthopaedic In-Training Examination (OITE) remains poorly defined.
Objective:
To compare the performance of two contemporary reasoning-oriented LLMs on the 2025 Israeli OITE and to identify item-level factors associated with correctness.
Methods:
All 95 multiple-choice questions (MCQs) from the 2025 OITE were analyzed. ChatGPT 5.1 thinking and Gemini Pro 3 thinking each answered all questions once using a standardized prompt. The primary outcome was question-level correctness according to the official answer key. Questions were annotated for question type, question stem word count, image presence, and model-reported certainty. Model-specific factors associated with correctness were assessed using logistic regression, and head-to-head comparisons were performed using the exact McNemar test and paired generalized estimating equations (GEE).
Results:
Both models achieved identical overall accuracy (68/95, 71.6%), with no measurable difference on paired comparison (odds ratio [OR], 1.00; 95% CI, 0.58-1.73; P=1.000). Accuracy was substantially higher on questions without images than on image-containing questions (90.5% vs 56.6%). In model-specific multivariable analyses, image presence was independently associated with lower correctness for both ChatGPT (adjusted OR [aOR], 0.12; 95% CI, 0.04-0.40; P<.001) and Gemini (aOR, 0.08; 95% CI, 0.02-0.28; P<.001). For Gemini, longer question stem word count was independently associated with greater odds of correctness (aOR per 10 words, 1.47; 95% CI, 1.01-2.13; P=.042), and Application questions were associated with lower odds of correctness than Knowledge questions (aOR, 0.26; 95% CI, 0.08-0.85; P=.026). In the primary paired GEE model, no measurable difference in correctness was observed between Gemini and ChatGPT (OR, 1.00; 95% CI, 0.58-1.73; P=1.000), whereas Application questions (OR, 0.34; 95% CI, 0.14-0.84; P=.019) and image-containing questions (OR, 0.09; 95% CI, 0.03-0.24; P<.001) were associated with lower odds of correctness.
Conclusions:
ChatGPT 5.1 thinking and Gemini Pro 3 thinking demonstrated similar overall performance on the 2025 Israeli OITE, with identical crude accuracy and no measurable difference in paired comparative analyses. Both models were substantially less accurate on image-containing questions, and Application questions were also associated with lower correctness, particularly in Gemini-specific and paired analyses. These findings suggest that current LLMs may have value as adjunctive educational tools in orthopaedic knowledge domains, but they remain insufficiently reliable for unsupervised use, particularly in image-dependent and application-based settings. Clinical Trial: not applicable
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.