Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Currently submitted to: JMIR AI

Date Submitted: May 25, 2026
Open Peer Review Period: May 29, 2026 - Jul 24, 2026
(currently open for review)

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Trustworthiness Evaluation of Medical Vision-Language Models: A Scoping Review of Robustness, Grounding, Hallucination, and Uncertainty

  • Binesh Sadanandan; 
  • Bibek Upadhayay; 
  • Vahid Behzadan

ABSTRACT

Background:

Medical vision-language models (VLMs) are increasingly used for clinical imaging tasks, but evaluation often emphasizes task accuracy rather than whether models are robust, calibrated, visually grounded, fair, and safe. Existing reviews summarize architectures or broad medical generative AI applications but do not systematically map trustworthiness evaluation practices for medical VLMs.

Objective:

This scoping review aimed to map which trustworthiness dimensions are evaluated for medical VLMs and multimodal large language models applied to clinical imaging, catalog the datasets, metrics, perturbation protocols, and image-reliance controls used, summarize reported safeguards, and identify reporting gaps to inform a minimum evaluation checklist.

Methods:

We conducted a PRISMA-ScR scoping review. PubMed/MEDLINE, Embase, Scopus, Web of Science Core Collection, and IEEE Xplore were searched for peer-reviewed English language studies published from January 1, 2022, to March 25, 2026. Eligible studies evaluated at least one trustworthiness dimension of a medical VLM in a clinical imaging context. Eight dimensions were charted: robustness, hallucination, visual grounding, calibration and uncertainty, fairness, interpretability, distribution-shift generalization, and safety. Two reviewers independently screened records and charted study characteristics, evaluation methods, grounding controls, safeguards, and reproducibility indicators. The protocol was not prospectively registered.

Results:

We screened 516 records, including 506 database records and 10 records from citation chasing and targeted update searches. After title and abstract screening, 80 reports advanced to full-text review; 72 unique reports remained after merging 8 duplicates. Thirty-four peer reviewed studies met inclusion criteria, and 29 preprints were tracked separately. Robustness was the most commonly evaluated dimension (14/34, 41.2%), followed by hallucination (9/34, 26.5%), distribution shift (6/34, 17.6%), and visual grounding (5/34, 14.7%). Interpretability and fairness were each evaluated in 4 studies (11.8%); calibration and safety were each addressed in 2 studies (5.9%). Only 5 studies (14.7%) used any image-reliance control, and 4 (11.8%) reported subgroup fairness analysis.

Conclusions:

Trustworthiness evaluation for medical VLMs remains uneven and incomplete. The largest gaps are grounding controls, calibration reporting, fairness analysis, and deployment safeguards. We propose the Minimum Trustworthiness Evaluation Checklist (MiTEC), an eight-item framework to help authors, reviewers, and regulators assess whether medical VLMs are evaluated beyond task accuracy.


 Citation

Please cite as:

Sadanandan B, Upadhayay B, Behzadan V

Trustworthiness Evaluation of Medical Vision-Language Models: A Scoping Review of Robustness, Grounding, Hallucination, and Uncertainty

JMIR Preprints. 25/05/2026:102330

DOI: 10.2196/preprints.102330

URL: https://preprints.jmir.org/preprint/102330

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.