JMIR Preprints #48039: Performance of ChatGPT on the Peruvian National Licensing Medical Examination: A Cross-sectional Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Performance of ChatGPT on the Peruvian National Licensing Medical Examination: A Cross-sectional Study

Javier A. Flores-Cohaila;
Abigaíl García-Vicente;
Sonia F. Vizcarra-Jiménez;
Janith Paola De la Cruz-Galán;
Jesús Gutiérrez-Arratia;
BLANCA GERALDINE QUIROGA TORRES;
Alvaro Taype-Rondan

ABSTRACT

Background:

ChatGPT has shown impressive performance in National Medical Licensing Examinations such as the United States Medical Licensing Examination, even passing it with expert-level performance. However, there is a lack of research on its performance in low-income countries' National Licensing Medical Examinations (NLME). In Peru, where almost one out of three examinees fails the NLME, ChatGPT (Chat Generative Pre-trained Transformers) has the potential to enhance medical education.

Objective:

We aimed to assess the accuracy of ChatGPT using GPT-3.5 and GPT-4 on the Peruvian National Licensing Medical Examination (ENAM). Additionally, we sought to identify factors associated with incorrect answers provided by ChatGPT.

Methods:

We utilized the ENAM 2022 dataset, which consisted of 180 multiple-choice questions, to evaluate the performance of ChatGPT. Various prompts were employed, and accuracy was evaluated. The performance of ChatGPT was compared to that of a sample of 1025 examinees. Factors such as question type, Peruvian-specific knowledge, discrimination, difficulty, quality of questions, and subject were analyzed to determine their influence on incorrect answers. To enhance ChatGPT's performance, questions that received incorrect answers underwent a three-step process involving different prompts, exploring the potential impact of role and context in prompts improve ChatGPT accuracy.

Results:

GPT-4 achieved an accuracy of 86% on the ENAM, followed by GPT-3.5 with 77%. The accuracy obtained by the 1025 examinees was 55%. There was a fair agreement (Kappa = 0.38) between GPT-3.5 and GPT-4. Moderate-to-high-difficulty questions were associated with incorrect answers in the crude and adjusted model for GPT-3.5 (Odds Ratio [OR] 6.6; Confidence Interval [CI] 95%: 2.73 to 15.95) and GPT-4 (OR 33.23; CI 95%: 4.3 to 257.12). After reinputting incorrect answers, GPT-3.5 went from 41 (100%) to 12 (29%) incorrect answers, and GPT-4 from 25 (100%) to 4 (16%).

Conclusions:

Our study found that ChatGPT (GPT-3.5 and GPT-4) can achieve expert-level performance on the ENAM, outperforming most of our examinees. We found fair agreement between both GPT-3.5 and GPT-4. The difficulty of questions was associated with incorrect answers, which may resemble human performance. Furthermore, by reinputting incorrect answers with different prompts and adding roles and context for ChatGPT, we found an improved accuracy

Citation

Please cite as:

Flores-Cohaila JA, García-Vicente A, Vizcarra-Jiménez SF, De la Cruz-Galán JP, Gutiérrez-Arratia J, QUIROGA TORRES BG, Taype-Rondan A

Performance of ChatGPT on the Peruvian National Licensing Medical Examination: Cross-Sectional Study

JMIR Med Educ 2023;9:e48039

DOI: 10.2196/48039

PMID: 37768724

PMCID: 10570896

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Education

Date Submitted: Apr 9, 2023

Open Peer Review Period: Apr 9, 2023 - Apr 24, 2023

Date Accepted: Sep 5, 2023

(closed for review but you can still tweet)

Performance of ChatGPT on the Peruvian National Licensing Medical Examination: A Cross-sectional Study

ABSTRACT

Citation

Copyright