Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Mar 27, 2025
Date Accepted: Aug 6, 2025

The final, peer-reviewed published version of this preprint can be found here:

A Vision-Language–Guided Multimodal Fusion Network for Glottic Carcinoma Early Diagnosis: Model Development and Validation Study

Jin Z, Shuai Y, Li Y, Chen M, Liu Y, Lei W, Fan X

A Vision-Language–Guided Multimodal Fusion Network for Glottic Carcinoma Early Diagnosis: Model Development and Validation Study

JMIR Med Inform 2025;13:e74902

DOI: 10.2196/74902

PMID: 41061147

PMCID: 12507326

A Vision-Language Guided Multimodal Fusion Network for Glottic Carcinoma Early Diagnosis: Model Development and Validation study

  • Zhaohui Jin; 
  • Yi Shuai; 
  • Yun Li; 
  • Mianmian Chen; 
  • Yumeng Liu; 
  • Wenbin Lei; 
  • Xiaomao Fan

ABSTRACT

Background:

Early diagnosis and intervention in glottic carcinoma can significantly reduce the local disease incidence rate and long-term prognosis. However, the accurate diagnosis of early glottic carcinoma is challenging due to its morphological similarity to vocal cord dysplasia, with the difficulty further exacerbated in medically underserved areas.

Objective:

This study aims to address the limitations of existing technologies by designing a vision-language multimodal model, providing a more efficient and accurate early diagnostic method for glottic carcinoma.

Methods:

The data used in this study was sourced from the information system of the First Affiliated Hospital of Sun Yat-sen University, comprising electronic medical records and 5,796 laryngoscopic images from 404 patients with glottic lesions. We propose a Vision-Language guided Multimodal Fusion Network (VLMF-Net) based on a large vision-language model for the early automated diagnosis of glottic carcinoma. The text processing module of this model utilizes the pre-trained Large Language Model Meta AI (LLaMa) to generate text vector representations, while the image processing module employs a pre-trained Vision Transformer (ViT) to extract features from laryngoscopic images, achieving cross-modal alignment through the Qformer module. By leveraging a self-developed feature fusion module, deep integration of text and image features is achieved, ultimately enabling classification diagnosis. To validate the model's performance, the study selected CLIP, BLIP-2, ALIGN, and VILT as baseline methods for experimental evaluation on the same dataset, with comprehensive performance assessment conducted using accuracy, recall, precision, and F1 score.

Results:

We found that on the internal test set, the VLMF-Net model significantly outperformed existing methods with an accuracy of 77.6% (CLIP: 70.5%, BLIP-2: 71.5%, ALIGN: 67.3%, VILT: 64.3%), achieving a 6.1 percentage point improvement over the best baseline model (BLIP-2). On the external test set, our method also demonstrated robust performance, achieving an accuracy of 73.9%, which is 4.6 percentage points higher than the second-best model (BLIP-2: 69.3%). This indicates that our model surpasses these methods in the early diagnosis of glottic cancer and exhibits strong generalization ability and robustness.

Conclusions:

The proposed VLMF-Net model can be effectively used for the early diagnosis of glottic carcinoma, helping to address the challenges in its early detection.


 Citation

Please cite as:

Jin Z, Shuai Y, Li Y, Chen M, Liu Y, Lei W, Fan X

A Vision-Language–Guided Multimodal Fusion Network for Glottic Carcinoma Early Diagnosis: Model Development and Validation Study

JMIR Med Inform 2025;13:e74902

DOI: 10.2196/74902

PMID: 41061147

PMCID: 12507326

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.