Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Aug 20, 2024
Open Peer Review Period: Aug 6, 2024 - Oct 1, 2024
Date Accepted: Nov 26, 2024
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Unveiling GPT-4V's hidden challenges behind high accuracy on USMLE questions: Observational Study

Yang Z, Yao Z, Tasmin M, Vashisht P, Jang WS, Ouyang F, Wang B, McManus D, Berlowitz D, Yu H

Unveiling GPT-4V's hidden challenges behind high accuracy on USMLE questions: Observational Study

J Med Internet Res 2025;27:e65146

DOI: 10.2196/65146

PMID: 39919278

PMCID: 11845889

Beyond GPT-4V's High Accuracy on USMLE Questions: Observational Study Exposing Hidden Flaws in Clinical Image Interpretation

  • Zhichao Yang; 
  • Zonghai Yao; 
  • Mahbuba Tasmin; 
  • Parth Vashisht; 
  • Won Seok Jang; 
  • Feiyun Ouyang; 
  • Beining Wang; 
  • David McManus; 
  • Dan Berlowitz; 
  • Hong Yu

ABSTRACT

Background:

Recent advancements in artificial intelligence (AI), such as ChatGPT, have demonstrated significant potential by achieving good scores on text-only United States Medical Licensing Examination (USMLE) exams and effectively answering questions from physicians. But its ability to interpret images is not well studied.

Objective:

Recent advancements in artificial intelligence (AI), such as ChatGPT, have demonstrated significant potential by achieving good scores on text-only United States Medical Licensing Examination (USMLE) exams and effectively answering questions from physicians. But its ability to interpret images is not well studied.

Methods:

We used multiple-choice questions with images from the USMLE to test GPT-4V’s accuracy and explanation quality. GPT-4V's accuracy was compared to two state-of-the-art LLMs, ChatGPT and GPT-4. The quality of explanations was evaluated across 3 qualitative metrics: comprehensive explanation, question information, and image interpretation. To better understand GPT-4V’s explanation ability, we modified a patient case report to resemble a typical “curbside consultation” between physicians.

Results:

GPT-4V outperformed ChatGPT (58.4%) and GPT-4 (83.6%) with an overall accuracy of 90.7%. For questions with images, GPT-4V achieved an accuracy of 62.0%, equivalent to the 70th-80th percentile among medical students. When GPT-4V answered correctly, its explanations were nearly as good as those provided by domain experts. However, incorrect answers often had poor explanation quality: 18.2% contained fabricated text, 45.5% had inferencing errors, and 76.3% demonstrated image misunderstandings. With human expert assistance, GPT-4V reduced errors by an average of 40.5%. Nevertheless, in curbside consult setting, GPT-4V required continuous specialized guidance to make partially correct diagnoses and subsequent examination recommendations.

Conclusions:

GPT-4V achieved high accuracy on multiple-choice questions with images. However, the explanation quality was poor when answered incorrectly, and this issue could not be efficiently resolved through expert interaction in clinical practice. Our findings highlight the need for more comprehensive evaluations beyond multiple-choice questions before integrating GPT-4V into clinical settings.


 Citation

Please cite as:

Yang Z, Yao Z, Tasmin M, Vashisht P, Jang WS, Ouyang F, Wang B, McManus D, Berlowitz D, Yu H

Unveiling GPT-4V's hidden challenges behind high accuracy on USMLE questions: Observational Study

J Med Internet Res 2025;27:e65146

DOI: 10.2196/65146

PMID: 39919278

PMCID: 11845889

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.