JMIR Preprints #86498: Confidence Measurement Metrics in Multimodal Large Language Models on Ultrasound-Based Radiology Cases: Comparison between self-reported, consistency-based, and hybrid methods.

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Confidence Measurement Metrics in Multimodal Large Language Models on Ultrasound-Based Radiology Cases: Comparison between self-reported, consistency-based, and hybrid methods.

Taewon Han;
Jaeseung Shin;
Jeong Hyun Lee;
Kyowon Gu

ABSTRACT

Background:

Large language models (LLMs) require specialized methodologies to quantify model confidence for safe deployment in healthcare systems; however, there is a lack of established methods for confidence assessment.

Objective:

To evaluate confidence metrics for multimodal-LLMs interpreting ultrasound-based radiology cases and compare between self-reported, consistency based, and hybrid methods.

Methods:

From a total of 330 quizzes in the Korean Society of Ultrasound in Medicine digital platform, we selected 94 multiple-choice cases. Four multimodal-LLMs were evaluated, three reasoning models (GPT-5, Claude-4.5-Sonnet, Gemini-3-Pro) and one general model (GPT-4o). Temperature was fixed at 1.0. Multiple confidence metrics were assessed: (i) self-reported by LLMs using prompts that elicited direct confidence percentages with answer, including first self-reported confidence and mean self-reported confidence, (ii) consistency-based metrics derived from 20 repeated outputs per case including relative entropy (R_H) calculated as 1 - H/log₂k (H = Shannon-entropy, k = number of answer choices) and majority-vote percentage, and (iii) a Top Weighted Score combining response frequency with self-reported confidence. Receiver Operating Characteristic (ROC) analysis for discrimination and Spearman correlation between accuracy and each confidence metric was conducted. Additionally, model calibration was assessed using Expected Calibration Error (ECE) and Brier Score. Processing time and token consumption (input, output, total) were recorded for each API call to evaluate resource utilization across models.

Results:

Diagnostic accuracy varied across models, with Gemini-3-Pro achieving the highest accuracy (74.47%), surpassing the median human accuracy (59.0%). Top Weighted Score, a hybrid metric combining response frequency and self-reported confidence, was the only metric achieving statistically significant correlations across all four models: Gemini-3-Pro (ρ=0.52), GPT-5 (ρ=0.43), Claude-4.5-Sonnet (ρ=0.30), and GPT-4o (ρ=0.22). ROC analysis revealed that Top Weighted Score demonstrated the highest discriminative ability, with area under the curve values of 0.826 (95% CI: 0.731–0.920) for Gemini-3-Pro, and 0.767 (95% CI: 0.668–0.866) for GPT-5. Top Weighted Score was the only metric achieving statistical significance in GPT-4o. Calibration analysis showed Top Weighted Score achieved the lowest ECE in GPT-5 (0.098), and Claude-4.5-Sonnet (0.192), while Gemini-3-Pro showed comparable calibration between R_H (0.119) and Top Weighted Score (0.122). Resource utilization analysis demonstrated that reasoning models required substantially longer processing times and higher token consumption compared to general models.

Conclusions:

In multimodal LLMs applied to ultrasound-based radiology cases, hybrid methods (Top Weighted Score) demonstrated consistently significant associations across all evaluated models and appear to serve as a more reliable indicator of diagnostic confidence compared to self-reported or consistency-based metrics alone. These findings support integrative confidence estimation approaches that incorporate response consistency, while highlighting the need for resource-efficient sampling strategies to enable practical clinical deployment.

Citation

Please cite as:

Han T, Shin J, Lee JH, Gu K

Confidence Measurement Metrics in Multimodal Large Language Models for Ultrasound-Based Radiology Cases: Comparative Evaluation Study of Self-Reported, Consistency-Based, and Hybrid Methods

J Med Internet Res 2026;28:e86498

DOI: 10.2196/86498

PMID: 42228942

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Oct 25, 2025

Date Accepted: May 18, 2026

Confidence Measurement Metrics in Multimodal Large Language Models on Ultrasound-Based Radiology Cases: Comparison between self-reported, consistency-based, and hybrid methods.

ABSTRACT

Citation

Copyright