JMIR Preprints #50865: Evaluation of GPT-4 for chest X-ray impression generation: A reader study on performance and perception

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Evaluation of GPT-4 for chest X-ray impression generation: A reader study on performance and perception

Sebastian Ziegelmayer;
Alexander W. Marka;
Nicolas Lenhart;
Nadja Nehls;
Stefan Reischl;
Felix Harder;
Andreas Sauter;
Marcus Makowski;
Markus Graf;
Joshua Gawlitza

ABSTRACT

Background:

The remarkable generative capabilities of multimodal foundation models are currently being explored for a variety of applications. Generating radiological impressions is a challenging task that could significantly reduce the workload of radiologists.

Objective:

Explore and analyze the generative abilities of a multimodal foundation model for Chest X-ray impression generation.

Methods:

To generate and evaluate impressions of chest X-rays based on different input modalities (image, text, text and image), a blinded radiological report was written for 25-cases of the publicly available NIH-dataset. The generative pre-trained transformer 4 was given image, finding section or both sequentially to generate a modality dependent impression. In a blind randomized reading, 4-radiologists rated the impressions based on “coherence”, “factual consistency”, “comprehensiveness”, and “medical harmfulness” on and were asked to classify the impression origin (human, AI), providing justification for their decision. Lastly text model evaluation metrics and their correlation with the radiological score (summation of the 4 dimensions) was assessed.

Results:

According to the radiological score, the human-written impression was rated highest, although not significantly different text-based impressions. The automated evaluation metrics showed moderate to substantial correlations to the radiological score for the image impressions, however individual scores were highly divergent among modalities, indicating insufficient representation of radiological goodness. Detection of AI-generated impressions varied by input and was 61% for text-based impressions. Leading reasons were modality dependent, for image input the main reason was factual consistency with 85 %, for text input a more homogeneous distribution was found, similar to radiological impressions classified as AI-generated. Impressions classified as AI-generated had significantly worse radiological scores even when written by a radiologist, indicating potential bias.

Conclusions:

Our study revealed significant discrepancies between a radiological assessment and common automatic evaluation metrics depending on the model input. The detection of AI-generated findings is subject to bias that highly rated impressions are perceived as human-written.

Citation

Please cite as:

Ziegelmayer S, Marka AW, Lenhart N, Nehls N, Reischl S, Harder F, Sauter A, Makowski M, Graf M, Gawlitza J

Evaluation of GPT-4’s Chest X-Ray Impression Generation: A Reader Study on Performance and Perception

J Med Internet Res 2023;25:e50865

DOI: 10.2196/50865

PMID: 38133918

PMCID: 10770784

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Jul 14, 2023

Date Accepted: Nov 27, 2023

Evaluation of GPT-4 for chest X-ray impression generation: A reader study on performance and perception

ABSTRACT

Citation

Copyright