JMIR Preprints #65108: Evaluating ChatGPT's Performance on Portuguese Medical Exam Questions: From Chatbot to Doctor?

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Evaluating ChatGPT's Performance on Portuguese Medical Exam Questions: From Chatbot to Doctor?

Filipe Prazeres

ABSTRACT

Background:

Advancements in ChatGPT are transforming medical education by providing new tools for assessment and learning, potentially enhancing evaluations for doctors and improving instructional effectiveness.

Objective:

This study evaluates the performance and consistency of ChatGPT-3.5 Turbo and ChatGPT-4o mini in solving European Portuguese medical exam questions (2023 National Exam for Access to Specialized Training - PNA) and compares their performance to human candidates.

Methods:

ChatGPT-3.5 Turbo was tested on the first part of the exam (74 questions) on July 18, 2024, and ChatGPT-4o mini on the second part (74 questions) on July 19, 2024. Each model generated an answer using its natural language processing capabilities. To test consistency, each model was asked, “Are you sure?” after providing an answer. Differences between the first and second responses of each model were analyzed using McNemar's test with continuity correction. A single-parameter t-test compared the models' performance to human candidates. Frequencies and percentages were used for categorical variables, and means and confidence intervals for numerical variables. Statistical significance was set at p <0.05.

Results:

ChatGPT-4o mini achieved an accuracy rate of 64.9% on the 2023 PNA exam, surpassing ChatGPT-3.5 Turbo. ChatGPT-4o mini outperformed medical candidates, while ChatGPT-3.5 Turbo had more moderate performance.

Conclusions:

This study highlights the advancements and potential of ChatGPT models in medical education, emphasizing the need for careful implementation with teacher oversight and further research.

Citation

Please cite as:

Prazeres F

ChatGPT’s Performance on Portuguese Medical Examination Questions: Comparative Analysis of ChatGPT-3.5 Turbo and ChatGPT-4o Mini

JMIR Med Educ 2025;11:e65108

DOI: 10.2196/65108

PMID: 40043219

PMCID: 11902880

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Education

Date Submitted: Aug 5, 2024

Date Accepted: Dec 12, 2024

Evaluating ChatGPT's Performance on Portuguese Medical Exam Questions: From Chatbot to Doctor?

ABSTRACT

Citation

Copyright

JMIR Preprints

Accepted for/Published in: JMIR Medical Education

Date Submitted: Aug 5, 2024

Date Accepted: Dec 12, 2024

Evaluating ChatGPT's Performance on Portuguese Medical Exam Questions: From Chatbot to Doctor?

ABSTRACT

Citation

Per the author's request the PDF is not available.

Copyright