Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Jul 14, 2023
Date Accepted: Nov 27, 2023
Evaluation of GPT-4 for chest X-ray impression generation: A reader study on performance and perception
ABSTRACT
Background:
The remarkable generative capabilities of multimodal foundation models are currently being explored for a variety of applications. Generating radiological impressions is a challenging task that could significantly reduce the workload of radiologists.
Objective:
Explore and analyze the generative abilities of a multimodal foundation model for Chest X-ray impression generation.
Methods:
To generate and evaluate impressions of chest X-rays based on different input modalities (image, text, text and image), a blinded radiological report was written for 25-cases of the publicly available NIH-dataset. The generative pre-trained transformer 4 was given image, finding section or both sequentially to generate a modality dependent impression. In a blind randomized reading, 4-radiologists rated the impressions based on “coherence”, “factual consistency”, “comprehensiveness”, and “medical harmfulness” on and were asked to classify the impression origin (human, AI), providing justification for their decision. Lastly text model evaluation metrics and their correlation with the radiological score (summation of the 4 dimensions) was assessed.
Results:
According to the radiological score, the human-written impression was rated highest, although not significantly different text-based impressions. The automated evaluation metrics showed moderate to substantial correlations to the radiological score for the image impressions, however individual scores were highly divergent among modalities, indicating insufficient representation of radiological goodness. Detection of AI-generated impressions varied by input and was 61% for text-based impressions. Leading reasons were modality dependent, for image input the main reason was factual consistency with 85 %, for text input a more homogeneous distribution was found, similar to radiological impressions classified as AI-generated. Impressions classified as AI-generated had significantly worse radiological scores even when written by a radiologist, indicating potential bias.
Conclusions:
Our study revealed significant discrepancies between a radiological assessment and common automatic evaluation metrics depending on the model input. The detection of AI-generated findings is subject to bias that highly rated impressions are perceived as human-written.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.