Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR AI

Date Submitted: Jun 1, 2025
Date Accepted: Jan 30, 2026
Date Submitted to PubMed: Feb 9, 2026

The final, peer-reviewed published version of this preprint can be found here:

Performance of 5 AI Models on United States Medical Licensing Examination Step 1 Questions: Comparative Observational Study

El Natour D, Abou Alfa M, Chaaban A, Assi R, Dally T, Bou Dargham B

Performance of 5 AI Models on United States Medical Licensing Examination Step 1 Questions: Comparative Observational Study

JMIR AI 2026;5:e76928

DOI: 10.2196/76928

PMID: 41662695

Performance of Five AI Models on USMLE Step 1 Questions: A Comparative Observational Study

  • Dania El Natour; 
  • Mohamad Abou Alfa; 
  • Ahmad Chaaban; 
  • Reda Assi; 
  • Toufic Dally; 
  • Bahaa Bou Dargham

ABSTRACT

Background:

Artificial intelligence (AI) models are increasingly being used in medical education. Although models like ChatGPT have previously demonstrated strong performance on USMLE-style questions, newer AI tools with enhanced capabilities are now available, necessitating comparative evaluations of their accuracy and reliability across different medical domains and question formats.

Objective:

To evaluate and compare the performance of five publicly available AI models—Grok, ChatGPT-4, Copilot, Gemini, and DeepSeek—on the USMLE Step 1 Free 120-question set, checking their accuracy, consistency, and performance across question types and medical subjects.

Methods:

This cross-sectional observational study was conducted between February 10 and March 5, 2025. Each of the 119 USMLE-style questions (excluding one audio-based item) was presented to each AI model using a standardized prompt cycle. Models answered each question three times to assess confidence and consistency. Questions were categorized as text-based or image-based, and as case-based or information-based. Statistical analysis was done using Chi-square and Fisher’s exact tests, with Bonferroni adjustment for pairwise comparisons.

Results:

Grok got the highest score (91.6%), followed by Copilot (84.9%), Gemini (84.0%), ChatGPT-4 (79.8%), and DeepSeek (72.3%). DeepSeek’s lower grade was due to an inability to process visual media, resulting in 0% accuracy on image-based items. When limited to text-only questions (n = 96), DeepSeek’s accuracy increased to 89.6%, matching Copilot. Grok showed the highest accuracy on image-based (91.3%) and case-based questions (89.7%), with statistically significant differences observed between Grok and DeepSeek on case-based items (p = .011). The models performed best in Biostatistics & Epidemiology (96.7%) and worst in Musculoskeletal, Skin, & Connective Tissue (62.9%). Grok maintained 100% consistency in responses, while Copilot demonstrated the most self-correction (94.1% consistency), improving its accuracy to 89.9% on the third attempt.

Conclusions:

AI models showed varying strengths across domains, with Grok emerging as the most accurate and consistent performer, particularly for image-based and reasoning-heavy questions. Although ChatGPT-4 remains widely used, newer models like Grok and Copilot show greater promise for integration into medical education. Continuous evaluation is essential as AI tools rapidly evolve.


 Citation

Please cite as:

El Natour D, Abou Alfa M, Chaaban A, Assi R, Dally T, Bou Dargham B

Performance of 5 AI Models on United States Medical Licensing Examination Step 1 Questions: Comparative Observational Study

JMIR AI 2026;5:e76928

DOI: 10.2196/76928

PMID: 41662695

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.