Currently submitted to: JMIR Medical Informatics
Date Submitted: Jan 12, 2026
Open Peer Review Period: Jan 21, 2026 - Mar 18, 2026
(currently open for review)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Multimodal Radiology Knowledge Graph Generation Using Vision Language Models
ABSTRACT
Background:
Knowledge graphs are increasingly important in radiology for representing factual clinical information and supporting downstream applications such as decision support, information retrieval, and structured reporting. However, generating radiology-specific knowledge graphs remains challenging due to the specialized vocabulary used in radiology reports, the scarcity of domain-annotated datasets, and the predominance of unimodal approaches that rely solely on text.
Objective:
To develop and evaluate a multimodal Vision-Language-Model (VLM) framework capable of generating radiology knowledge graphs using both radiographic images and the corresponding reports.
Methods:
We designed a VLM-based knowledge graph generation framework that integrates radiology images and free-text reports through instruction tuning and visual instruction tuning. The model is optimized for long-context radiology reports and structured triplet extraction. Its performance was compared with existing unimodal baselines on benchmark datasets.
Results:
Our multimodal VLM-KG (MIMIC) demonstrated the strongest overall performance across standard NLG metrics, achieving the highest BLEU scores (BLEU-1: 54.98, BLEU-2: 49.65, BLEU-3: 46.12, BLEU-4: 43.29), substantially outperforming all unimodal baselines, including the BERT-based Dygiee++ model. This improvement highlights the effectiveness of multimodal learning, where the integration of visual and linguistic information enhances contextual understanding in text generation. Although Dygiee++ achieved a comparable ROUGE-L score (56.49), VLM-KG (MIMIC) provided markedly higher BLEU scores, indicating stronger n-gram overlap and more accurate triplet generation. VLM-KG (MIMIC) also achieved a competitive ROUGE-L score of 54.69, slightly lower than LLM-KG (MIMIC) (56.53), suggesting that while multimodal features improve precision, they may introduce minor variability in generated outputs. Additionally, LLM-KG (MIMIC) consistently outperformed LLM-KG (IU) across all metrics (e.g., BLEU-3: 35.96 vs. 18.02), underscoring the advantages of training on a large-scale, domain-specific dataset.
Conclusions:
This study presents the first multimodal VLM-driven approach for radiology knowledge graph generation. By leveraging both images and reports, the framework overcomes limitations of previous text-only systems and provides a more comprehensive foundation for medical knowledge representation and downstream radiology informatics applications.Vision Language Models; Large Language Models; Knowledge Graph; Radiology; Multimodal AI; Medical NLP
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.