Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Education

Date Submitted: Dec 20, 2025
Date Accepted: May 8, 2026

The final, peer-reviewed published version of this preprint can be found here:

Performance of Large Language Models on the Brazilian National Medical Education Examination: Comparative Benchmark Study

Fernandes da Silva FdL, Roeder EA, Bruneti Severino JV, Nespolo Berger M, Basei de Paula PA, Ferreira D, Han Veiga M, de Moraes TP, Lenci Marques G

Performance of Large Language Models on the Brazilian National Medical Education Examination: Comparative Benchmark Study

JMIR Med Educ 2026;12:e89839

DOI: 10.2196/89839

PMID: 42213480

Performance of Large Language Models on the Brazilian National Medical Education Examination (ENAMED): Comparative Benchmark Study

  • Francys de Luca Fernandes da Silva; 
  • Eduardo Augusto Roeder; 
  • João Victor Bruneti Severino; 
  • Matheus Nespolo Berger; 
  • Pedro Angelo Basei de Paula; 
  • Davi Ferreira; 
  • Maria Han Veiga; 
  • Thyago Proença de Moraes; 
  • Gustavo Lenci Marques

ABSTRACT

Background:

Large language models (LLMs) show potential for clinical decision support, but current evaluations rely heavily on Anglophone benchmarks, limiting their applicability in specific healthcare contexts like Brazil.

Objective:

To compare the performance of frontier generalist LLMs and a specialized model (Charcot) on the 2026 Brazilian National Examination for Medical Education (ENAMED), assessing accuracy, response times, and collective error patterns.

Methods:

This observational study evaluated ten LLMs, including GPT-5, Gemini 2.5 Pro, and the Brazilian-specialized model Charcot. The models completed the ENAMED 2026 examination (99 valid items) across five independent runs with randomized question and alternative ordering. The primary outcome was mean accuracy compared to the official answer key. Secondary outcomes included Normalized Mean Response Time (NMRT) and Convergence Error (CE)—defined as a collective bias where at least three generalist models consistently selected the same incorrect alternative. Qualitative analysis of rationales was performed for questions exhibiting high convergence or clinical relevance.

Results:

Nine models exceeded 85% accuracy. The specialized model, Charcot, achieved the highest mean accuracy (96.96%), significantly outperforming the top generalist models, GPT-5 (94.34%) and Gemini 2.5 Pro (93.94%) (P < .001). Charcot demonstrated superior performance in items requiring knowledge of specific Brazilian guidelines. The CE analysis revealed that generalist models often converged on incorrect answers in domains such as tuberculosis and prenatal care, whereas the specialized model aligned with local protocols. Conversely, model consensus correctly identified an inconsistency in the official answer key regarding indigenous health. No significant correlation was found between response time and global accuracy.

Conclusions:

Domain specialization in the Portuguese language and Brazilian medical context confers a measurable advantage in complex medical tasks, reducing errors derived from training biases present in generalist models. While frontier models demonstrate near-human or superhuman performance on multiple-choice questions, the persistence of collective errors highlights the need for continuous expert supervision. Furthermore, the consensus among models suggests their potential utility as auditing tools for validating high-stakes medical examinations.


 Citation

Please cite as:

Fernandes da Silva FdL, Roeder EA, Bruneti Severino JV, Nespolo Berger M, Basei de Paula PA, Ferreira D, Han Veiga M, de Moraes TP, Lenci Marques G

Performance of Large Language Models on the Brazilian National Medical Education Examination: Comparative Benchmark Study

JMIR Med Educ 2026;12:e89839

DOI: 10.2196/89839

PMID: 42213480

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.