Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Jul 14, 2023
Date Accepted: Nov 27, 2023

The final, peer-reviewed published version of this preprint can be found here:

Evaluation of GPT-4’s Chest X-Ray Impression Generation: A Reader Study on Performance and Perception

Ziegelmayer S, Marka AW, Lenhart N, Nehls N, Reischl S, Harder F, Sauter A, Makowski M, Graf M, Gawlitza J

Evaluation of GPT-4’s Chest X-Ray Impression Generation: A Reader Study on Performance and Perception

J Med Internet Res 2023;25:e50865

DOI: 10.2196/50865

PMID: 38133918

PMCID: 10770784

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Multimodal foundation model (GPT-4) for chest X-ray impression generation: Performance, perception, and evaluation.

  • Sebastian Ziegelmayer; 
  • Alexander W. Marka; 
  • Nicolas Lenhart; 
  • Nadja Nehls; 
  • Stefan Reischl; 
  • Felix Harder; 
  • Andreas Sauter; 
  • Marcus Makowski; 
  • Markus Graf; 
  • Joshua Gawlitza

ABSTRACT

Background:

The remarkable generative capabilities of multimodal foundation models are currently being explored for a variety of applications. Generating radiological impressions is a challenging task that could significantly reduce the workload of radiologists.

Objective:

Explore and analyze the generative abilities of a multimodal foundation model for Chest X-ray impression generation.

Methods:

To generate and evaluate impressions of chest X-rays based on different input modalities (image, text, text and image), a blinded radiological report was written for 25-cases of the publicly available NIH-dataset. The generative pre-trained transformer 4 was given image, finding section or both sequentially to generate a modality dependent impression. In a blind randomized reading, 4-radiologists rated the impressions based on “coherence”, “factual consistency”, “comprehensiveness”, and “medical harmfulness” on and were asked to classify the impression origin (human, AI), providing justification for their decision. Lastly text model evaluation metrics and their correlation with the radiological score (summation of the 4 dimensions) was assessed.

Results:

According to the radiological score, the human-written impression was rated highest, although not significantly different text-based impressions. The automated evaluation metrics showed moderate to substantial correlations to the radiological score for the image impressions, however individual scores were highly divergent among modalities, indicating insufficient representation of radiological goodness. Detection of AI-generated impressions varied by input and was 61% for text-based impressions. Leading reasons were modality dependent, for image input the main reason was factual consistency with 85 %, for text input a more homogeneous distribution was found, similar to radiological impressions classified as AI-generated. Impressions classified as AI-generated had significantly worse radiological scores even when written by a radiologist, indicating potential bias.

Conclusions:

Our study revealed significant discrepancies between a radiological assessment and common automatic evaluation metrics depending on the model input. The detection of AI-generated findings is subject to bias that highly rated impressions are perceived as human-written.


 Citation

Please cite as:

Ziegelmayer S, Marka AW, Lenhart N, Nehls N, Reischl S, Harder F, Sauter A, Makowski M, Graf M, Gawlitza J

Evaluation of GPT-4’s Chest X-Ray Impression Generation: A Reader Study on Performance and Perception

J Med Internet Res 2023;25:e50865

DOI: 10.2196/50865

PMID: 38133918

PMCID: 10770784

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.