Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Education

Date Submitted: Aug 5, 2024
Date Accepted: Dec 12, 2024

The final, peer-reviewed published version of this preprint can be found here:

ChatGPT’s Performance on Portuguese Medical Examination Questions: Comparative Analysis of ChatGPT-3.5 Turbo and ChatGPT-4o Mini

Prazeres F

ChatGPT’s Performance on Portuguese Medical Examination Questions: Comparative Analysis of ChatGPT-3.5 Turbo and ChatGPT-4o Mini

JMIR Med Educ 2025;11:e65108

DOI: 10.2196/65108

PMID: 40043219

PMCID: 11902880

Evaluating ChatGPT's Performance on Portuguese Medical Exam Questions: From Chatbot to Doctor?

  • Filipe Prazeres

ABSTRACT

Background:

Advancements in ChatGPT are transforming medical education by providing new tools for assessment and learning, potentially enhancing evaluations for doctors and improving instructional effectiveness.

Objective:

This study evaluates the performance and consistency of ChatGPT-3.5 Turbo and ChatGPT-4o mini in solving European Portuguese medical exam questions (2023 National Exam for Access to Specialized Training - PNA) and compares their performance to human candidates.

Methods:

ChatGPT-3.5 Turbo was tested on the first part of the exam (74 questions) on July 18, 2024, and ChatGPT-4o mini on the second part (74 questions) on July 19, 2024. Each model generated an answer using its natural language processing capabilities. To test consistency, each model was asked, “Are you sure?” after providing an answer. Differences between the first and second responses of each model were analyzed using McNemar's test with continuity correction. A single-parameter t-test compared the models' performance to human candidates. Frequencies and percentages were used for categorical variables, and means and confidence intervals for numerical variables. Statistical significance was set at p <0.05.

Results:

ChatGPT-4o mini achieved an accuracy rate of 64.9% on the 2023 PNA exam, surpassing ChatGPT-3.5 Turbo. ChatGPT-4o mini outperformed medical candidates, while ChatGPT-3.5 Turbo had more moderate performance.

Conclusions:

This study highlights the advancements and potential of ChatGPT models in medical education, emphasizing the need for careful implementation with teacher oversight and further research.


 Citation

Please cite as:

Prazeres F

ChatGPT’s Performance on Portuguese Medical Examination Questions: Comparative Analysis of ChatGPT-3.5 Turbo and ChatGPT-4o Mini

JMIR Med Educ 2025;11:e65108

DOI: 10.2196/65108

PMID: 40043219

PMCID: 11902880

Per the author's request the PDF is not available.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.