Accepted for/Published in: JMIR Medical Education
Date Submitted: Aug 5, 2024
Date Accepted: Dec 12, 2024
Evaluating ChatGPT's Performance on Portuguese Medical Exam Questions: From Chatbot to Doctor?
ABSTRACT
Background:
Advancements in ChatGPT are transforming medical education by providing new tools for assessment and learning, potentially enhancing evaluations for doctors and improving instructional effectiveness.
Objective:
This study evaluates the performance and consistency of ChatGPT-3.5 Turbo and ChatGPT-4o mini in solving European Portuguese medical exam questions (2023 National Exam for Access to Specialized Training - PNA) and compares their performance to human candidates.
Methods:
ChatGPT-3.5 Turbo was tested on the first part of the exam (74 questions) on July 18, 2024, and ChatGPT-4o mini on the second part (74 questions) on July 19, 2024. Each model generated an answer using its natural language processing capabilities. To test consistency, each model was asked, “Are you sure?” after providing an answer. Differences between the first and second responses of each model were analyzed using McNemar's test with continuity correction. A single-parameter t-test compared the models' performance to human candidates. Frequencies and percentages were used for categorical variables, and means and confidence intervals for numerical variables. Statistical significance was set at p <0.05.
Results:
ChatGPT-4o mini achieved an accuracy rate of 64.9% on the 2023 PNA exam, surpassing ChatGPT-3.5 Turbo. ChatGPT-4o mini outperformed medical candidates, while ChatGPT-3.5 Turbo had more moderate performance.
Conclusions:
This study highlights the advancements and potential of ChatGPT models in medical education, emphasizing the need for careful implementation with teacher oversight and further research.
Citation
Per the author's request the PDF is not available.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.