Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Formative Research

Date Submitted: Oct 17, 2023
Date Accepted: Jan 10, 2024

The final, peer-reviewed published version of this preprint can be found here:

Vision-Language Model for Generating Textual Descriptions From Clinical Images: Model Development and Validation Study

Ji J, Chen X, Hou Y, Pan Y

Vision-Language Model for Generating Textual Descriptions From Clinical Images: Model Development and Validation Study

JMIR Form Res 2024;8:e32690

DOI: 10.2196/32690

PMID: 38329788

PMCID: 10884898

A Vision-Language Model for Generating Textual Descriptions from Clinical Images: Model Development and Validation

  • Jia Ji; 
  • Xinyu Chen; 
  • Yongshuai Hou; 
  • Youcheng Pan

ABSTRACT

Background:

Automatic generation of radiology reports, which seeks to create a free-text description from a clinical radiograph, is emerging as a pivotal intersection between clinical medicine and artificial intelligence. Leveraging natural language processing technologies can accelerate report creation, enhancing healthcare quality and standardization. However, most existing studies have not yet fully tapped into the combined potential of advanced language and vision models.

Objective:

The purpose of this study was to explore the integration of pretrained vision-language models (VLM) into radiology report generation. This would enable the VLM to automatically convert clinical images into high-quality textual reports.

Methods:

In our research, we introduced a radiology report generation model named ClinicalBLIP, building upon the foundational InstructBLIP model and refining it using clinical image-to-text datasets. A multi-stage finetuning approach via LoRA was proposed to deepen the semantic comprehension of the visual encoder and the large language model for clinical imagery. Furthermore, prior knowledge was integrated through prompt learning to enhance the precision of the reports generated. Experiments were conducted on both the IU X-RAY and MIMIC-CXR datasets, with ClinicalBLIP was compared to several leading methods.

Results:

Experimental results reveal that ClinicalBLIP obtains superior scores of 0.570/0.365 and 0.534/0.313 on the IU X-RAY/MIMIC-CXR test sets for METEOR and ROUGE metrics, respectively. This performance notably surpasses that of existing state-of-the-art methods. Further evaluations confirm the effectiveness of the multi-stage finetuning and the integration of prior information, leading to substantial improvements.

Conclusions:

The proposed ClinicalBLIP demonstrated robustness and effectiveness in enhancing clinical radiology report generation, suggesting significant promise for real-world clinical applications.


 Citation

Please cite as:

Ji J, Chen X, Hou Y, Pan Y

Vision-Language Model for Generating Textual Descriptions From Clinical Images: Model Development and Validation Study

JMIR Form Res 2024;8:e32690

DOI: 10.2196/32690

PMID: 38329788

PMCID: 10884898

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.