Accepted for/Published in: JMIR Medical Education
Date Submitted: Dec 20, 2025
Date Accepted: May 8, 2026
Performance of Large Language Models on the Brazilian National Medical Education Examination (ENAMED): Comparative Benchmark Study
ABSTRACT
Background:
Large language models (LLMs) show potential for clinical decision support, but current evaluations rely heavily on Anglophone benchmarks, limiting their applicability in specific healthcare contexts like Brazil.
Objective:
To compare the performance of frontier generalist LLMs and a specialized model (Charcot) on the 2026 Brazilian National Examination for Medical Education (ENAMED), assessing accuracy, response times, and collective error patterns.
Methods:
This observational study evaluated ten LLMs, including GPT-5, Gemini 2.5 Pro, and the Brazilian-specialized model Charcot. The models completed the ENAMED 2026 examination (99 valid items) across five independent runs with randomized question and alternative ordering. The primary outcome was mean accuracy compared to the official answer key. Secondary outcomes included Normalized Mean Response Time (NMRT) and Convergence Error (CE)—defined as a collective bias where at least three generalist models consistently selected the same incorrect alternative. Qualitative analysis of rationales was performed for questions exhibiting high convergence or clinical relevance.
Results:
Nine models exceeded 85% accuracy. The specialized model, Charcot, achieved the highest mean accuracy (96.96%), significantly outperforming the top generalist models, GPT-5 (94.34%) and Gemini 2.5 Pro (93.94%) (P < .001). Charcot demonstrated superior performance in items requiring knowledge of specific Brazilian guidelines. The CE analysis revealed that generalist models often converged on incorrect answers in domains such as tuberculosis and prenatal care, whereas the specialized model aligned with local protocols. Conversely, model consensus correctly identified an inconsistency in the official answer key regarding indigenous health. No significant correlation was found between response time and global accuracy.
Conclusions:
Domain specialization in the Portuguese language and Brazilian medical context confers a measurable advantage in complex medical tasks, reducing errors derived from training biases present in generalist models. While frontier models demonstrate near-human or superhuman performance on multiple-choice questions, the persistence of collective errors highlights the need for continuous expert supervision. Furthermore, the consensus among models suggests their potential utility as auditing tools for validating high-stakes medical examinations.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.