JMIR Preprints #91301: Multimodal Radiology Knowledge Graph Generation Using Vision Language Models

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Multimodal Radiology Knowledge Graph Generation Using Vision Language Models

Abdullah Abdullah;
Seong Tae Kim

ABSTRACT

Background:

Knowledge graphs are increasingly important in radiology for representing factual clinical information and supporting downstream applications such as decision support, information retrieval, and structured reporting. However, generating radiology-specific knowledge graphs remains challenging due to the specialized vocabulary used in radiology reports, the scarcity of domain-annotated datasets, and the predominance of unimodal approaches that rely solely on text.

Objective:

To develop and evaluate a multimodal Vision-Language-Model (VLM) framework capable of generating radiology knowledge graphs using both radiographic images and the corresponding reports.

Methods:

We designed a VLM-based knowledge graph generation framework that integrates radiology images and free-text reports through instruction tuning and visual instruction tuning. The model is optimized for long-context radiology reports and structured triplet extraction. Its performance was compared with existing unimodal baselines on benchmark datasets.

Results:

Our multimodal VLM-KG (MIMIC) demonstrated the strongest overall performance across standard NLG metrics, achieving the highest BLEU scores (BLEU-1: 54.98, BLEU-2: 49.65, BLEU-3: 46.12, BLEU-4: 43.29), substantially outperforming all unimodal baselines, including the BERT-based Dygiee++ model. This improvement highlights the effectiveness of multimodal learning, where the integration of visual and linguistic information enhances contextual understanding in text generation. Although Dygiee++ achieved a comparable ROUGE-L score (56.49), VLM-KG (MIMIC) provided markedly higher BLEU scores, indicating stronger n-gram overlap and more accurate triplet generation. VLM-KG (MIMIC) also achieved a competitive ROUGE-L score of 54.69, slightly lower than LLM-KG (MIMIC) (56.53), suggesting that while multimodal features improve precision, they may introduce minor variability in generated outputs. Additionally, LLM-KG (MIMIC) consistently outperformed LLM-KG (IU) across all metrics (e.g., BLEU-3: 35.96 vs. 18.02), underscoring the advantages of training on a large-scale, domain-specific dataset.

Conclusions:

This study presents the first multimodal VLM-driven approach for radiology knowledge graph generation. By leveraging both images and reports, the framework overcomes limitations of previous text-only systems and provides a more comprehensive foundation for medical knowledge representation and downstream radiology informatics applications.Vision Language Models; Large Language Models; Knowledge Graph; Radiology; Multimodal AI; Medical NLP

Citation

Please cite as:

Abdullah A, Kim ST

Multimodal Radiology Knowledge Graph Generation Using Vision Language Models

JMIR Preprints. 12/01/2026:91301

DOI: 10.2196/preprints.91301

URL: https://preprints.jmir.org/preprint/91301

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Currently submitted to: JMIR Medical Informatics

Date Submitted: Jan 12, 2026

Open Peer Review Period: Jan 21, 2026 - Mar 18, 2026

(closed for review but you can still tweet)

NOTE: This is an unreviewed Preprint

Multimodal Radiology Knowledge Graph Generation Using Vision Language Models

ABSTRACT

Citation

Copyright