Currently submitted to: JMIR Medical Informatics
Date Submitted: Jan 12, 2026
Open Peer Review Period: Jan 21, 2026 - Mar 18, 2026
(closed for review but you can still tweet)
NOTE: This is an unreviewed Preprint
Warning: This is a unreviewed preprint (What is a preprint?). Readers are warned that the document has not been peer-reviewed by expert/patient reviewers or an academic editor, may contain misleading claims, and is likely to undergo changes before final publication, if accepted, or may have been rejected/withdrawn (a note "no longer under consideration" will appear above).
Peer review me: Readers with interest and expertise are encouraged to sign up as peer-reviewer, if the paper is within an open peer-review period (in this case, a "Peer Review Me" button to sign up as reviewer is displayed above). All preprints currently open for review are listed here. Outside of the formal open peer-review period we encourage you to tweet about the preprint.
Citation: Please cite this preprint only for review purposes or for grant applications and CVs (if you are the author).
Final version: If our system detects a final peer-reviewed "version of record" (VoR) published in any journal, a link to that VoR will appear below. Readers are then encourage to cite the VoR instead of this preprint.
Settings: If you are the author, you can login and change the preprint display settings, but the preprint URL/DOI is supposed to be stable and citable, so it should not be removed once posted.
Submit: To post your own preprint, simply submit to any JMIR journal, and choose the appropriate settings to expose your submitted version as preprint.
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Multimodal Radiology Knowledge Graph Generation Using Vision Language Models
ABSTRACT
Background:
Knowledge graphs are increasingly important in radiology for representing factual clinical information and supporting downstream applications such as decision support, information retrieval, and structured reporting. However, generating radiology-specific knowledge graphs remains challenging due to the specialized vocabulary used in radiology reports, the scarcity of domain-annotated datasets, and the predominance of unimodal approaches that rely solely on text.
Objective:
To develop and evaluate a multimodal Vision-Language-Model (VLM) framework capable of generating radiology knowledge graphs using both radiographic images and the corresponding reports.
Methods:
We designed a VLM-based knowledge graph generation framework that integrates radiology images and free-text reports through instruction tuning and visual instruction tuning. The model is optimized for long-context radiology reports and structured triplet extraction. Its performance was compared with existing unimodal baselines on benchmark datasets.
Results:
Our multimodal VLM-KG (MIMIC) demonstrated the strongest overall performance across standard NLG metrics, achieving the highest BLEU scores (BLEU-1: 54.98, BLEU-2: 49.65, BLEU-3: 46.12, BLEU-4: 43.29), substantially outperforming all unimodal baselines, including the BERT-based Dygiee++ model. This improvement highlights the effectiveness of multimodal learning, where the integration of visual and linguistic information enhances contextual understanding in text generation. Although Dygiee++ achieved a comparable ROUGE-L score (56.49), VLM-KG (MIMIC) provided markedly higher BLEU scores, indicating stronger n-gram overlap and more accurate triplet generation. VLM-KG (MIMIC) also achieved a competitive ROUGE-L score of 54.69, slightly lower than LLM-KG (MIMIC) (56.53), suggesting that while multimodal features improve precision, they may introduce minor variability in generated outputs. Additionally, LLM-KG (MIMIC) consistently outperformed LLM-KG (IU) across all metrics (e.g., BLEU-3: 35.96 vs. 18.02), underscoring the advantages of training on a large-scale, domain-specific dataset.
Conclusions:
This study presents the first multimodal VLM-driven approach for radiology knowledge graph generation. By leveraging both images and reports, the framework overcomes limitations of previous text-only systems and provides a more comprehensive foundation for medical knowledge representation and downstream radiology informatics applications.Vision Language Models; Large Language Models; Knowledge Graph; Radiology; Multimodal AI; Medical NLP
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.