Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Mar 27, 2025
Date Accepted: Aug 6, 2025
A Vision-Language Guided Multimodal Fusion Network for Glottic Carcinoma Early Diagnosis: Model Development and Validation study
ABSTRACT
Background:
Early diagnosis and intervention in glottic carcinoma can significantly reduce the local disease incidence rate and long-term prognosis. However, the accurate diagnosis of early glottic carcinoma is challenging due to its morphological similarity to vocal cord dysplasia, with the difficulty further exacerbated in medically underserved areas.
Objective:
This study aims to address the limitations of existing technologies by designing a vision-language multimodal model, providing a more efficient and accurate early diagnostic method for glottic carcinoma.
Methods:
The data used in this study was sourced from the information system of the First Affiliated Hospital of Sun Yat-sen University, comprising electronic medical records and 5,796 laryngoscopic images from 404 patients with glottic lesions. We propose a Vision-Language guided Multimodal Fusion Network (VLMF-Net) based on a large vision-language model for the early automated diagnosis of glottic carcinoma. The text processing module of this model utilizes the pre-trained Large Language Model Meta AI (LLaMa) to generate text vector representations, while the image processing module employs a pre-trained Vision Transformer (ViT) to extract features from laryngoscopic images, achieving cross-modal alignment through the Qformer module. By leveraging a self-developed feature fusion module, deep integration of text and image features is achieved, ultimately enabling classification diagnosis. To validate the model's performance, the study selected CLIP, BLIP-2, ALIGN, and VILT as baseline methods for experimental evaluation on the same dataset, with comprehensive performance assessment conducted using accuracy, recall, precision, and F1 score.
Results:
We found that on the internal test set, the VLMF-Net model significantly outperformed existing methods with an accuracy of 77.6% (CLIP: 70.5%, BLIP-2: 71.5%, ALIGN: 67.3%, VILT: 64.3%), achieving a 6.1 percentage point improvement over the best baseline model (BLIP-2). On the external test set, our method also demonstrated robust performance, achieving an accuracy of 73.9%, which is 4.6 percentage points higher than the second-best model (BLIP-2: 69.3%). This indicates that our model surpasses these methods in the early diagnosis of glottic cancer and exhibits strong generalization ability and robustness.
Conclusions:
The proposed VLMF-Net model can be effectively used for the early diagnosis of glottic carcinoma, helping to address the challenges in its early detection.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.