JMIR Preprints #102330: Trustworthiness Evaluation of Medical Vision-Language Models: A Scoping Review of Robustness, Grounding, Hallucination, and Uncertainty

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Trustworthiness Evaluation of Medical Vision-Language Models: A Scoping Review of Robustness, Grounding, Hallucination, and Uncertainty

Binesh Sadanandan;
Bibek Upadhayay;
Vahid Behzadan

ABSTRACT

Background:

Medical vision-language models (VLMs) are increasingly used for clinical imaging tasks, but evaluation often emphasizes task accuracy rather than whether models are robust, calibrated, visually grounded, fair, and safe. Existing reviews summarize architectures or broad medical generative AI applications but do not systematically map trustworthiness evaluation practices for medical VLMs.

Objective:

This scoping review aimed to map which trustworthiness dimensions are evaluated for medical VLMs and multimodal large language models applied to clinical imaging, catalog the datasets, metrics, perturbation protocols, and image-reliance controls used, summarize reported safeguards, and identify reporting gaps to inform a minimum evaluation checklist.

Methods:

We conducted a PRISMA-ScR scoping review. PubMed/MEDLINE, Embase, Scopus, Web of Science Core Collection, and IEEE Xplore were searched for peer-reviewed English language studies published from January 1, 2022, to March 25, 2026. Eligible studies evaluated at least one trustworthiness dimension of a medical VLM in a clinical imaging context. Eight dimensions were charted: robustness, hallucination, visual grounding, calibration and uncertainty, fairness, interpretability, distribution-shift generalization, and safety. Two reviewers independently screened records and charted study characteristics, evaluation methods, grounding controls, safeguards, and reproducibility indicators. The protocol was not prospectively registered.

Results:

We screened 516 records, including 506 database records and 10 records from citation chasing and targeted update searches. After title and abstract screening, 80 reports advanced to full-text review; 72 unique reports remained after merging 8 duplicates. Thirty-four peer reviewed studies met inclusion criteria, and 29 preprints were tracked separately. Robustness was the most commonly evaluated dimension (14/34, 41.2%), followed by hallucination (9/34, 26.5%), distribution shift (6/34, 17.6%), and visual grounding (5/34, 14.7%). Interpretability and fairness were each evaluated in 4 studies (11.8%); calibration and safety were each addressed in 2 studies (5.9%). Only 5 studies (14.7%) used any image-reliance control, and 4 (11.8%) reported subgroup fairness analysis.

Conclusions:

Trustworthiness evaluation for medical VLMs remains uneven and incomplete. The largest gaps are grounding controls, calibration reporting, fairness analysis, and deployment safeguards. We propose the Minimum Trustworthiness Evaluation Checklist (MiTEC), an eight-item framework to help authors, reviewers, and regulators assess whether medical VLMs are evaluated beyond task accuracy.

Citation

Please cite as:

Sadanandan B, Upadhayay B, Behzadan V

Trustworthiness Evaluation of Medical Vision-Language Models: A Scoping Review of Robustness, Grounding, Hallucination, and Uncertainty

JMIR Preprints. 25/05/2026:102330

DOI: 10.2196/preprints.102330

URL: https://preprints.jmir.org/preprint/102330

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Currently submitted to: JMIR AI

Date Submitted: May 25, 2026

Open Peer Review Period: May 29, 2026 - Jul 24, 2026

(closed for review but you can still tweet)

NOTE: This is an unreviewed Preprint

Trustworthiness Evaluation of Medical Vision-Language Models: A Scoping Review of Robustness, Grounding, Hallucination, and Uncertainty

ABSTRACT

Citation

Copyright