Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Aug 20, 2024
Open Peer Review Period: Aug 6, 2024 - Oct 1, 2024
Date Accepted: Nov 26, 2024
(closed for review but you can still tweet)
Beyond GPT-4V's High Accuracy on USMLE Questions: Observational Study Exposing Hidden Flaws in Clinical Image Interpretation
ABSTRACT
Background:
Recent advancements in artificial intelligence (AI), such as ChatGPT, have demonstrated significant potential by achieving good scores on text-only United States Medical Licensing Examination (USMLE) exams and effectively answering questions from physicians. But its ability to interpret images is not well studied.
Objective:
Recent advancements in artificial intelligence (AI), such as ChatGPT, have demonstrated significant potential by achieving good scores on text-only United States Medical Licensing Examination (USMLE) exams and effectively answering questions from physicians. But its ability to interpret images is not well studied.
Methods:
We used multiple-choice questions with images from the USMLE to test GPT-4V’s accuracy and explanation quality. GPT-4V's accuracy was compared to two state-of-the-art LLMs, ChatGPT and GPT-4. The quality of explanations was evaluated across 3 qualitative metrics: comprehensive explanation, question information, and image interpretation. To better understand GPT-4V’s explanation ability, we modified a patient case report to resemble a typical “curbside consultation” between physicians.
Results:
GPT-4V outperformed ChatGPT (58.4%) and GPT-4 (83.6%) with an overall accuracy of 90.7%. For questions with images, GPT-4V achieved an accuracy of 62.0%, equivalent to the 70th-80th percentile among medical students. When GPT-4V answered correctly, its explanations were nearly as good as those provided by domain experts. However, incorrect answers often had poor explanation quality: 18.2% contained fabricated text, 45.5% had inferencing errors, and 76.3% demonstrated image misunderstandings. With human expert assistance, GPT-4V reduced errors by an average of 40.5%. Nevertheless, in curbside consult setting, GPT-4V required continuous specialized guidance to make partially correct diagnoses and subsequent examination recommendations.
Conclusions:
GPT-4V achieved high accuracy on multiple-choice questions with images. However, the explanation quality was poor when answered incorrectly, and this issue could not be efficiently resolved through expert interaction in clinical practice. Our findings highlight the need for more comprehensive evaluations beyond multiple-choice questions before integrating GPT-4V into clinical settings.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.