Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Education

Date Submitted: Jun 13, 2024
Date Accepted: Nov 23, 2024

The final, peer-reviewed published version of this preprint can be found here:

Performance of ChatGPT-4o on the Japanese Medical Licensing Examination: Evalution of Accuracy in Text-Only and Image-Based Questions

Miyazaki Y, Hata M, Omori H, Hirashima A, Nakagawa Y, Eto M, Takahashi S, Ikeda M

Performance of ChatGPT-4o on the Japanese Medical Licensing Examination: Evalution of Accuracy in Text-Only and Image-Based Questions

JMIR Med Educ 2024;10:e63129

DOI: 10.2196/63129

PMID: 39718557

PMCID: 11687171

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Performance and Errors of ChatGPT-4o on the Japanese Medical Licensing Examination: Solving All Questions Including Images with Over 90% Accuracy

  • Yuki Miyazaki; 
  • Masahiro Hata; 
  • Hisaki Omori; 
  • Atsuya Hirashima; 
  • Yuta Nakagawa; 
  • Mitsuhiro Eto; 
  • Shun Takahashi; 
  • Manabu Ikeda

ABSTRACT

Background:

Recent advancements in AI technology have begun to play a crucial role in medical education. AI models, such as ChatGPT, have shown promise in various applications, including answering medical questions and assisting in clinical decision-making. However, there is limited research on the performance of these models on comprehensive medical licensing exams.

Objective:

This study aims to evaluate the performance of ChatGPT-4o on the 118th Japanese Medical Licensing Examination (JMLE), specifically assessing its ability to handle both text and image-based questions, and to analyze the types of errors it makes.

Methods:

ChatGPT-4o was utilized to complete all 400 questions of the 118th JMLE held in February 2024. The model, updated with data up to May 13, 2023, was assessed on its ability to answer both text-only and image-based questions. Questions were directly input into the chat interface without the use of prompt engineering or memory functions. Due to the daily response limit of ChatGPT-4o, the study was conducted from May 13 to May 19, 2024. An independent samples t-test compared the correct response rates between image-based and text-only questions. Statistical significance was set at ????<.05 for all two-tailed tests.

Results:

ChatGPT-4o achieved an overall correct response rate of 93.25%, with 93.48% for image-based and 93.18% for text-only questions. The difference in correct response rates between text-only and image-based questions was not statistically significant (t-value: -0.074, p-value: 0.941). The errors were classified into four categories: diagnostic errors, logical errors, medical knowledge errors, and reading comprehension errors. Discussion ChatGPT-4o demonstrated high proficiency in both text-centric and image-based questions, marking a significant improvement over previous iterations of GPT models. This performance meets the passing criteria set by the Ministry of Health, Labor, and Welfare for the JNMLE, which requires a total score of at least 160/200 points for compulsory questions, at least 230/300 points for non-compulsory questions, and no more than 3 incorrect choices in critical exclusion questions. Although ChatGPT-4o met the overall passing criteria, some responses indicated potentially problematic clinical judgments, such as incorrect triage decisions and prioritization errors in clinical scenarios. These findings underscore the need for improved clinical judgment capabilities in AI models.

Conclusions:

ChatGPT-4o successfully met the passing criteria for the 118th JNMLE, demonstrating high proficiency in handling both text and image-based questions. This marks a significant improvement over previous iterations of GPT models, particularly in managing multimodal tasks. The model excelled in answering specific medical knowledge questions, indicating a strong grasp of medical facts and concepts. However, it struggled with clinical judgment and prioritization, as evidenced by errors in triage decisions and the selection of appropriate diagnostic procedures. These findings highlight the need for continued enhancement of AI models to ensure their reliability and accuracy in clinical decision-making. While generative AI like ChatGPT-4o shows great potential, understanding and addressing its limitations will be critical for its effective integration into medical education and practice.


 Citation

Please cite as:

Miyazaki Y, Hata M, Omori H, Hirashima A, Nakagawa Y, Eto M, Takahashi S, Ikeda M

Performance of ChatGPT-4o on the Japanese Medical Licensing Examination: Evalution of Accuracy in Text-Only and Image-Based Questions

JMIR Med Educ 2024;10:e63129

DOI: 10.2196/63129

PMID: 39718557

PMCID: 11687171

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.