JMIR Preprints #98819: LLM-based conditional perplexity scoring distinguishes physician-verified diagnoses from incorrect differential diagnoses in case reports: a preliminary evaluation

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

LLM-based conditional perplexity scoring distinguishes physician-verified diagnoses from incorrect differential diagnoses in case reports: a preliminary evaluation

Takanobu Hirosawa;
Yukinori Harada;
Ren Kawamura;
Taro Shimizu

ABSTRACT

Background:

Large language models (LLMs) are increasingly used to generate differential diagnoses from clinical narratives. However, LLM-based diagnostic clinical decision support systems still lack quantitative measure of how strongly a diagnosis is supported by the available case description. In natural language processing, conditional perplexity score quantifies how predictable a target text is given in a preceding context, with lower scores indicating greater predictability. We hypothesized that this concept can be adapted to diagnostic reasoning by treating the pre-diagnostic case description as the context and a diagnosis as the target text.

Objective:

To evaluate whether an LLM-based conditional perplexity score can quantify clinical compatibility between a case description and differential diagnoses. Specifically, we hypothesized that the correct LLM-generated diagnosis verified by physicians would have lower conditional perplexity scores than incorrect LLM-generated differential diagnoses. A secondary outcome was to compare this scoring behavior across differential diagnosis lists generated by different LLMs.

Methods:

We performed a preliminary computational analysis of 392 peer-reviewed diagnostic case reports published in The American Journal of Case Reports in 2022. For each case, the pre-diagnostic clinical description was used as the conditioning context, and the case report-defined final diagnoses were treated as the gold standard. Conditional perplexity scores for differential diagnosis lists previously generated by LLaMA2, Bard, and GPT-4 were computed using an independent longer-context LLM, Qwen2.5-1.5B. We compared case report-defined final diagnoses, correct LLM-generated diagnoses verified by physicians, and incorrect generated diagnoses using nonparametric comparisons and receiver operating characteristic analyses.

Results:

All 392 cases had complete case descriptions, and case report-defined final diagnoses. Across the top-10 differential diagnosis lists generated by LLaMA2, Bard, and GPT-4, 823 correct LLM-generated diagnoses verified by physicians, and 10,875 incorrect generated diagnoses were analyzed. Case report-defined final diagnoses had lower conditional perplexity scores than incorrect generated diagnoses (median 39.9 [IQR 17.7-119.9] vs 133.3 [37.5-672.1]; P<.001). Correct LLM-generated diagnoses also had lower conditional perplexity scores than incorrect LLM-generated diagnoses (43.3 [16.6-147.5] vs 133.3 [37.6-672.1]; P<.001). Candidate-level discrimination was moderate overall (AUC 0.666, 95% CI 0.647-0.685) and was highest for GPT-4 generated differential diagnosis lists (AUC 0.678, 95% CI 0.647-0.707), followed by LLaMA2 (AUC 0.662, 95% CI 0.626-0.696) and Bard (AUC 0.648, 95% CI 0.616-0.682). In paired within-case analyses, the correct LLM-generated diagnosis had a lower conditional perplexity score than the average incorrect diagnosis in 91.1% of evaluable LLaMA2 cases, 88.9% of GPT-4 cases, and 88.1% of Bard cases (all P<.001).

Conclusions:

A conditional perplexity score derived from an independent LLM provided a quantitative signal that distinguished case report-defined and correct LLM-generated diagnoses verified by physicians from incorrect LLM-generated diagnoses. These findings support conditional perplexity score as a promising adjunct for studying LLM-generated differential diagnoses. Clinical Trial: Not applicable.

Citation

Please cite as:

Hirosawa T, Harada Y, Kawamura R, Shimizu T

LLM-based conditional perplexity scoring distinguishes physician-verified diagnoses from incorrect differential diagnoses in case reports: a preliminary evaluation

JMIR Preprints. 19/04/2026:98819

DOI: 10.2196/preprints.98819

URL: https://preprints.jmir.org/preprint/98819

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Currently submitted to: JMIR Formative Research

Date Submitted: Apr 19, 2026

Open Peer Review Period: Apr 19, 2026 - Jun 14, 2026

(currently open for review)

LLM-based conditional perplexity scoring distinguishes physician-verified diagnoses from incorrect differential diagnoses in case reports: a preliminary evaluation

ABSTRACT

Citation

Copyright