Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Oct 25, 2025
Date Accepted: May 18, 2026

The final, peer-reviewed published version of this preprint can be found here:

Confidence Measurement Metrics in Multimodal Large Language Models for Ultrasound-Based Radiology Cases: Comparative Evaluation Study of Self-Reported, Consistency-Based, and Hybrid Methods

Han T, Shin J, Lee JH, Gu K

Confidence Measurement Metrics in Multimodal Large Language Models for Ultrasound-Based Radiology Cases: Comparative Evaluation Study of Self-Reported, Consistency-Based, and Hybrid Methods

J Med Internet Res 2026;28:e86498

DOI: 10.2196/86498

PMID: 42228942

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Consistency-Based Confidence in Multimodal Large Language Models on Radiology Cases: Comparison with Self-Report

  • Taewon Han; 
  • Jaeseung Shin; 
  • Jeong Hyun Lee; 
  • Kyowon Gu

ABSTRACT

Background:

Large language models (LLMs) require specialized methodologies to quantify model confidence for safe deployment in healthcare systems; however, there is a lack of established methods for confidence assessment.

Objective:

To evaluate output consistency as a confidence metrics for multimodal-LLMs interpreting radiology cases and compare with self-reported.

Methods:

From a total of 311 quizzes in the Korean Society of Ultrasound in Medicine digital platform, we selected 75 multiple-choice cases. Six multimodal-LLMs were evaluated, three reasoning-focused models (o1, Claude-3.7-Sonnet, Gemini-2.5-Pro) and three general models (GPT-4o, Claude-3.5-Sonnet, Gemini-1.5-Pro). Temperature was fixed at 1.0. Two confidence metrics were assessed: (i) self-reported by LLMs using prompts that elicited direct confidence percentages with answer, and (ii) consistency-based metrics derived from 20 repeated interpretations per case including relative entropy (R_H) calculated as 1 - H/log₂k (H = Shannon-entropy, k = number of repetitions) and majority vote proportion. Receiver Operating Characteristics (ROC) analysis for discrimination and Spearman correlation (r) between accuracy and each confidence metric was conducted. Additionally, model calibration was assessed using Expected Calibration Error (ECE).

Results:

Consistency-based metrics demonstrated significant correlation with diagnostic accuracy for Claude-3.7-Sonnet (percentage, r=0.314; R_H, r=0.310), Gemini-2.5-Pro (percentage, r=0.354; R_H, r=0.347), and GPT-4o (percentage, r=0.321; R_H, r=0.318). ROC analysis revealed that consistency-based metrics outperformed self-reported confidence in discriminative ability, with area under the curve values of 0.663 (95% CI: 0.545–0.768) for Claude-3.7-Sonnet, 0.694 (95% CI: 0.577–0.795) for Gemini-2.5-Pro, and 0.671 (95% CI: 0.533–0.775) for GPT-4o. For consistency-based metrics, Regular ECE (10-bin) ranged from 0.313–0.485, while optimal ECE ranged from 0.276–0.478 with varying bin configurations.

Conclusions:

In multimodal-LLMs applied to radiology case, consistency-based metrics provide a more dependable indicator of diagnostic confidence than the self‑report.


 Citation

Please cite as:

Han T, Shin J, Lee JH, Gu K

Confidence Measurement Metrics in Multimodal Large Language Models for Ultrasound-Based Radiology Cases: Comparative Evaluation Study of Self-Reported, Consistency-Based, and Hybrid Methods

J Med Internet Res 2026;28:e86498

DOI: 10.2196/86498

PMID: 42228942

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.