JMIR Preprints #76928: Performance of Five AI Models on USMLE Step 1 Questions: A Comparative Observational Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Performance of Five AI Models on USMLE Step 1 Questions: A Comparative Observational Study

Dania El Natour;
Mohamad Abou Alfa;
Ahmad Chaaban;
Reda Assi;
Toufic Dally;
Bahaa Bou Dargham

ABSTRACT

Background:

Artificial intelligence (AI) models are increasingly being used in medical education. Although models like ChatGPT have previously demonstrated strong performance on USMLE-style questions, newer AI tools with enhanced capabilities are now available, necessitating comparative evaluations of their accuracy and reliability across different medical domains and question formats.

Objective:

To evaluate and compare the performance of five publicly available AI models—Grok, ChatGPT-4, Copilot, Gemini, and DeepSeek—on the USMLE Step 1 Free 120-question set, checking their accuracy, consistency, and performance across question types and medical subjects.

Methods:

This cross-sectional observational study was conducted between February 10 and March 5, 2025. Each of the 119 USMLE-style questions (excluding one audio-based item) was presented to each AI model using a standardized prompt cycle. Models answered each question three times to assess confidence and consistency. Questions were categorized as text-based or image-based, and as case-based or information-based. Statistical analysis was done using Chi-square and Fisher’s exact tests, with Bonferroni adjustment for pairwise comparisons.

Results:

Grok got the highest score (91.6%), followed by Copilot (84.9%), Gemini (84.0%), ChatGPT-4 (79.8%), and DeepSeek (72.3%). DeepSeek’s lower grade was due to an inability to process visual media, resulting in 0% accuracy on image-based items. When limited to text-only questions (n = 96), DeepSeek’s accuracy increased to 89.6%, matching Copilot. Grok showed the highest accuracy on image-based (91.3%) and case-based questions (89.7%), with statistically significant differences observed between Grok and DeepSeek on case-based items (p = .011). The models performed best in Biostatistics & Epidemiology (96.7%) and worst in Musculoskeletal, Skin, & Connective Tissue (62.9%). Grok maintained 100% consistency in responses, while Copilot demonstrated the most self-correction (94.1% consistency), improving its accuracy to 89.9% on the third attempt.

Conclusions:

AI models showed varying strengths across domains, with Grok emerging as the most accurate and consistent performer, particularly for image-based and reasoning-heavy questions. Although ChatGPT-4 remains widely used, newer models like Grok and Copilot show greater promise for integration into medical education. Continuous evaluation is essential as AI tools rapidly evolve.

Citation

Please cite as:

El Natour D, Abou Alfa M, Chaaban A, Assi R, Dally T, Bou Dargham B

Performance of 5 AI Models on United States Medical Licensing Examination Step 1 Questions: Comparative Observational Study

JMIR AI 2026;5:e76928

DOI: 10.2196/76928

PMID: 41662695

PMCID: 13010076

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR AI

Date Submitted: Jun 1, 2025

Date Accepted: Jan 30, 2026

Date Submitted to PubMed: Feb 9, 2026

Performance of Five AI Models on USMLE Step 1 Questions: A Comparative Observational Study

ABSTRACT

Citation

Copyright