Accepted for/Published in: JMIR Medical Education
Date Submitted: Apr 3, 2025
Date Accepted: Nov 25, 2025
GPT-4o and OpenAI o1 Performance on the 2024 Spanish Competitive Medical Specialty Access Exam: Cross-Sectional Quantitative Evaluation
ABSTRACT
Background:
In recent years, generative artificial intelligence and large language models (LLMs) have rapidly advanced, offering significant potential to transform medical education. Several studies have evaluated the performance of chatbots on multiple-choice medical exams.
Objective:
The study aims to assess the performance of two LLMs – GPT-4o and OpenAI o1 – on the Médico Interno Residente (MIR) 2024 exam, the Spanish national medical test that determines eligibility for competitive medical specialist training positions.
Methods:
A total of 176 questions from the MIR 2024 exam were analysed. Each question was presented individually to the chatbots to ensure independence and prevent memory retention bias. No additional prompts were introduced to minimize potential bias. For each LLM, response consistency under verification prompting was assessed by systematically asking, "Are you sure?" after each response. Accuracy was defined as the percentage of correct responses compared to the official answers provided by the Spanish Ministry of Health. It was assessed for GPT-4o, OpenAI o1 and, as a benchmark, for a consensus of medical specialists and for the average MIR candidate. Sub-analyses included performance across different medical subjects, question difficulty (quintiles based on the percentage of examinees correctly answering each question), and question types (clinical cases versus theoretical questions; positive versus negative questions).
Results:
Overall accuracy was 158/176 (89.8%) for GPT-4o - and 160/176 (90.0%) after verification prompting, 163/176 (92.6%) for OpenAI o1 - and 164/176 (93.2%) after verification prompting, 166/176 (94.3%) for the consensus of medical specialists, and 100/176 (56.6%) for the average MIR candidate. Both LLMs and the consensus of medical specialists outperformed the average MIR candidate across all 20 medical subjects analyzed, with ≥80% LLMs’ accuracy in most domains. A performance gradient was observed: LLMs’ accuracy gradually declined as question difficulty increased. Slightly higher accuracy was observed for clinical cases compared to theoretical questions, as well as for positive questions compared to negative ones. Both models demonstrated high response consistency, with near-perfect agreement between initial responses and those after the verification prompting.
Conclusions:
These findings highlight the excellent performance of GPT-4o and OpenAI o1 on the MIR 2024 exam, demonstrating consistent accuracy across medical subjects and question types. The integration of LLMs into medical education presents promising opportunities and is likely to reshape how students prepare for licensing exams and change our understanding of medical education. Further research should explore how wording, language, prompting techniques and image-based questions can influence LLMs’ accuracy, as well as evaluate the performance of emerging artificial intelligence (AI) models in similar assessments.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.