JMIR Preprints #64348: Token Probabilities to Mitigate Large Language Models Overconfidence in Answering Medical Questions

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Token Probabilities to Mitigate Large Language Models Overconfidence in Answering Medical Questions

Raphaël Bentegeac;
Bastien Le Guellec;
Grégory Kuchcinski;
Philippe Amouyel;
Aghiles Hamroun

ABSTRACT

Background:

Chatbots have demonstrated promising capabilities in Medicine, scoring passing grades for board examinations across various specialties. However, their tendency to express high levels of confidence in their responses, even when incorrect, poses a limitation to their utility in clinical settings.

Objective:

To examine whether token probabilities outperform chatbots' Expressed Confidence levels in predict-ing the accuracy of their responses to medical questions.

Methods:

Seven large languages models (LLMs), comprising both commercial (GPT-3.5, GPT-4 and GPT-4o) and open-source (Llama 3-8b, Llama 3-70b, Phi-3-Mini, and Phi-3-Medium), were prompted to respond to a set of 2,522 questions from the US Medical Licensing Examination (MedQA database). Addition-ally, the models rated their confidence from 0 to 100 and the token probability of each response was extracted. The models’ success rates were measured, and the predictive performances of both Ex-pressed Confidence and Response Token Probability in predicting response accuracy were evaluated using Area Under the Receiver Operating Characteristic Curve (AUROC), Adapted Calibration Error (ACE) and Brier score. Sensitivity analyses were conducted using additional questions sourced from other databases in English (MedMCQA, n=2,797), Chinese (MedQA Main-land China, n=3,413 and Taiwan, n=2,808), and French (FrMedMCQA, n=1,079).

Results:

Overall, mean accuracy ranged from 52.7%[50.8-54.7] for Phi-3-Mini to 87.6%[86.2-88.9] for GPT-4o. Across the US Medical Licensing Examination questions, all chatbots consistently expressed high levels of confidence in their responses (ranging from 90[90-90] for Llama 3-70B to 100[100–100] for GPT-3.5). However, Expressed Confidence failed to predict response accuracy (AUROC ranging from 0.52[0.50-0.53] for Phi 3 Mini to 0.68[0.65-0.71] for GPT-4o). In contrast, the Response Token Probability consistently outperformed Expressed Confidence for predicting response accuracy (AU-ROC ranging from 0.67[0.65-0.69] for Phi-3-Mini to 0.83[0.81-0.85] for Llama 3-70B, all p-values<0.001). Furthermore, all models demonstrated imperfect calibration, with a general trend towards overconfidence. These findings were consistent in sensitivity analyses.

Conclusions:

Due to the limited capacity of chatbots to accurately evaluate their confidence when responding to medical queries, clinicians and patients should abstain from relying on their self-rated certainty. In-stead, token probabilities emerge as a promising and easily accessible alternative for gauging the in-ner doubts of these models.

Citation

Please cite as:

Bentegeac R, Le Guellec B, Kuchcinski G, Amouyel P, Hamroun A

Token Probabilities to Mitigate Large Language Models Overconfidence in Answering Medical Questions: Quantitative Study

J Med Internet Res 2025;27:e64348

DOI: 10.2196/64348

PMID: 40882190

PMCID: 12396779

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Jul 15, 2024

Date Accepted: Jul 1, 2025

Date Submitted to PubMed: Jul 17, 2025

Token Probabilities to Mitigate Large Language Models Overconfidence in Answering Medical Questions

ABSTRACT

Citation

Copyright