Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Dec 18, 2023
Open Peer Review Period: Dec 25, 2023 - Feb 19, 2024
Date Accepted: Mar 13, 2024
(closed for review but you can still tweet)
Evaluating ChatGPT-4's Diagnostic Accuracy: Impact of Visual Data Integration
ABSTRACT
Background:
There are several multimodal generative artificial intelligence (AI) systems, including ChatGPT-4 with vision, also known as ChatGPT-4V or ChatGPT-4Vision, accept image data with text data. However, the change in diagnostic accuracy of ChatGPT-4 by adding image data is unknown.
Objective:
We compared the diagnostic accuracy between ChatGPT-4 with vision, inputting text and image (intervention) and ChatGPT-4 without vision, inputting only text (control), for case descriptions derived by case reports.
Methods:
We used the dataset of case descriptions and final diagnoses derived from the American Journal of Case Reports published from January 2022 to March 2023. We also extracted the figures and tables mentioned in case descriptions as image data. We excluded non-diagnostics, pediatric, and case reports without figures or tables in their case descriptions. From the case descriptions and images, ChatGPT-4 with vision generated the differential-diagnosis lists. We compared the diagnostic accuracy by ChatGPT-4 without vision, which was inputted the same case descriptions without images. Two physicians independently evaluated whether the final diagnosis was included in the lists. Discrepancies were resolved by another physician.
Results:
A total of 363 case descriptions were included. The rate of final diagnoses within the top 10 differential-diagnosis lists generated by ChatGPT-4 with vision was 85.1% (309/363), which was not different compared to 87.9% (319/363) by ChatGPT-4 without vision (P=.33). The rate of final diagnoses as the top diagnosis generated by ChatGPT-4 with vision was 44.4% (161/363), inferior to 55.9% (203/363) by ChatGPT-4 without vision (P=.002).
Conclusions:
The rates of final diagnoses within the differential-diagnosis lists generated by ChatGPT-4 with vision were not improved compared to those without vision. The rate of final diagnoses as the top diagnosis generated by ChatGPT-4 with vision was inferior to that without vision. These results suggest that a multimodal generative AI system, ChatGPT-4 with vision, mainly relies on the text data, even though it accepts image data for generating differentials. Multimodal generative AI systems should be further developed to improve diagnostic performance through better integration of clinical data before being utilized in medicine. Clinical Trial: Not applicable
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.