JMIR Preprints #63129: Performance and Errors of ChatGPT-4o on the Japanese Medical Licensing Examination: Solving All Questions Including Images with Over 90% Accuracy

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Performance and Errors of ChatGPT-4o on the Japanese Medical Licensing Examination: Solving All Questions Including Images with Over 90% Accuracy

Yuki Miyazaki;
Masahiro Hata;
Hisaki Omori;
Atsuya Hirashima;
Yuta Nakagawa;
Mitsuhiro Eto;
Shun Takahashi;
Manabu Ikeda

ABSTRACT

Background:

Recent advancements in AI technology have begun to play a crucial role in medical education. AI models, such as ChatGPT, have shown promise in various applications, including answering medical questions and assisting in clinical decision-making. However, there is limited research on the performance of these models on comprehensive medical licensing exams.

Objective:

This study aims to evaluate the performance of ChatGPT-4o on the 118th Japanese Medical Licensing Examination (JMLE), specifically assessing its ability to handle both text and image-based questions, and to analyze the types of errors it makes.

Methods:

ChatGPT-4o was utilized to complete all 400 questions of the 118th JMLE held in February 2024. The model, updated with data up to May 13, 2023, was assessed on its ability to answer both text-only and image-based questions. Questions were directly input into the chat interface without the use of prompt engineering or memory functions. Due to the daily response limit of ChatGPT-4o, the study was conducted from May 13 to May 19, 2024. An independent samples t-test compared the correct response rates between image-based and text-only questions. Statistical significance was set at ????<.05 for all two-tailed tests.

Results:

ChatGPT-4o achieved an overall correct response rate of 93.25%, with 93.48% for image-based and 93.18% for text-only questions. The difference in correct response rates between text-only and image-based questions was not statistically significant (t-value: -0.074, p-value: 0.941). The errors were classified into four categories: diagnostic errors, logical errors, medical knowledge errors, and reading comprehension errors. Discussion ChatGPT-4o demonstrated high proficiency in both text-centric and image-based questions, marking a significant improvement over previous iterations of GPT models. This performance meets the passing criteria set by the Ministry of Health, Labor, and Welfare for the JNMLE, which requires a total score of at least 160/200 points for compulsory questions, at least 230/300 points for non-compulsory questions, and no more than 3 incorrect choices in critical exclusion questions. Although ChatGPT-4o met the overall passing criteria, some responses indicated potentially problematic clinical judgments, such as incorrect triage decisions and prioritization errors in clinical scenarios. These findings underscore the need for improved clinical judgment capabilities in AI models.

Conclusions:

ChatGPT-4o successfully met the passing criteria for the 118th JNMLE, demonstrating high proficiency in handling both text and image-based questions. This marks a significant improvement over previous iterations of GPT models, particularly in managing multimodal tasks. The model excelled in answering specific medical knowledge questions, indicating a strong grasp of medical facts and concepts. However, it struggled with clinical judgment and prioritization, as evidenced by errors in triage decisions and the selection of appropriate diagnostic procedures. These findings highlight the need for continued enhancement of AI models to ensure their reliability and accuracy in clinical decision-making. While generative AI like ChatGPT-4o shows great potential, understanding and addressing its limitations will be critical for its effective integration into medical education and practice.

Citation

Please cite as:

Miyazaki Y, Hata M, Omori H, Hirashima A, Nakagawa Y, Eto M, Takahashi S, Ikeda M

Performance of ChatGPT-4o on the Japanese Medical Licensing Examination: Evalution of Accuracy in Text-Only and Image-Based Questions

JMIR Med Educ 2024;10:e63129

DOI: 10.2196/63129

PMID: 39718557

PMCID: 11687171

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Education

Date Submitted: Jun 13, 2024

Date Accepted: Nov 23, 2024

Performance and Errors of ChatGPT-4o on the Japanese Medical Licensing Examination: Solving All Questions Including Images with Over 90% Accuracy

ABSTRACT

Citation

Copyright