JMIR Preprints #32690: A Vision-Language Model for Generating Textual Descriptions from Clinical Images: Model Development and Validation

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

A Vision-Language Model for Generating Textual Descriptions from Clinical Images: Model Development and Validation

Jia Ji;
Xinyu Chen;
Yongshuai Hou;
Youcheng Pan

ABSTRACT

Background:

Automatic generation of radiology reports, which seeks to create a free-text description from a clinical radiograph, is emerging as a pivotal intersection between clinical medicine and artificial intelligence. Leveraging natural language processing technologies can accelerate report creation, enhancing healthcare quality and standardization. However, most existing studies have not yet fully tapped into the combined potential of advanced language and vision models.

Objective:

The purpose of this study was to explore the integration of pretrained vision-language models (VLM) into radiology report generation. This would enable the VLM to automatically convert clinical images into high-quality textual reports.

Methods:

In our research, we introduced a radiology report generation model named ClinicalBLIP, building upon the foundational InstructBLIP model and refining it using clinical image-to-text datasets. A multi-stage finetuning approach via LoRA was proposed to deepen the semantic comprehension of the visual encoder and the large language model for clinical imagery. Furthermore, prior knowledge was integrated through prompt learning to enhance the precision of the reports generated. Experiments were conducted on both the IU X-RAY and MIMIC-CXR datasets, with ClinicalBLIP was compared to several leading methods.

Results:

Experimental results reveal that ClinicalBLIP obtains superior scores of 0.570/0.365 and 0.534/0.313 on the IU X-RAY/MIMIC-CXR test sets for METEOR and ROUGE metrics, respectively. This performance notably surpasses that of existing state-of-the-art methods. Further evaluations confirm the effectiveness of the multi-stage finetuning and the integration of prior information, leading to substantial improvements.

Conclusions:

The proposed ClinicalBLIP demonstrated robustness and effectiveness in enhancing clinical radiology report generation, suggesting significant promise for real-world clinical applications.

Citation

Please cite as:

Ji J, Chen X, Hou Y, Pan Y

Vision-Language Model for Generating Textual Descriptions From Clinical Images: Model Development and Validation Study

JMIR Form Res 2024;8:e32690

DOI: 10.2196/32690

PMID: 38329788

PMCID: 10884898

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Formative Research

Date Submitted: Oct 17, 2023

Date Accepted: Jan 10, 2024

A Vision-Language Model for Generating Textual Descriptions from Clinical Images: Model Development and Validation

ABSTRACT

Citation

Copyright