Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: May 12, 2025
Date Accepted: Sep 22, 2025

The final, peer-reviewed published version of this preprint can be found here:

Performance of Large Language Models in Diagnosing Rare Hematologic Diseases and the Impact of Their Diagnostic Outputs on Physicians: Combined Retrospective and Prospective Study

Yu H, Chen T, Zhang X, Yang Y, Liu Q, Yang C, Shen K, Li H, Tang W, Zhong X, Shuai X, Yu X, Liao Y, Wang C, Zhu H, Wu Y

Performance of Large Language Models in Diagnosing Rare Hematologic Diseases and the Impact of Their Diagnostic Outputs on Physicians: Combined Retrospective and Prospective Study

J Med Internet Res 2025;27:e77334

DOI: 10.2196/77334

PMID: 41070713

PMCID: 12511990

Large language models for rare hematologic disease diagnosis: retrospective performance and prospective impact on physicians

  • Hongbin Yu; 
  • Tian Chen; 
  • Xin Zhang; 
  • Yunfan Yang; 
  • Qinyu Liu; 
  • Chenlu Yang; 
  • Kai Shen; 
  • He Li; 
  • Wenjiao Tang; 
  • Xushu Zhong; 
  • Xiao Shuai; 
  • Xinmei Yu; 
  • Yi Liao; 
  • Chiyi Wang; 
  • Huanling Zhu; 
  • Yu Wu

ABSTRACT

Background:

Rare hematologic diseases are frequently underdiagnosed or misdiagnosed due to their clinical complexity. Whether new-generation large language models (LLMs), particularly those employing chain-of-thought (CoT) reasoning, can improve diagnostic accuracy remains unclear.

Objective:

To evaluate the diagnostic performance of new-generation commercial LLMs in rare hematologic diseases and to determine whether LLM output enhances physicians’ diagnostic accuracy.

Methods:

We conducted a two-phase study. In the retrospective phase, we evaluated seven mainstream LLMs on 158 non-public real-world admission records covering nine rare hematologic diseases, assessing diagnostic performance using Top-10 accuracy and mean reciprocal rank (MRR), and evaluating ranking stability via Jaccard similarity and entropy. Spearman’s rank correlation was used to examine the association between physicians’ diagnoses and LLM-generated outputs. In the prospective phase, 28 physicians with varying levels of experience diagnosed five cases each, gaining access to LLM-generated diagnoses across three sequential steps to assess whether LLMs can improve diagnostic accuracy.

Results:

In the retrospective phase, ChatGPT-o1-preview demonstrated the highest Top-10 accuracy (70.3%) and MRR (0.577), achieving performance comparable to that of human physicians. DeepSeek-R1 ranked second. Diagnostic performance was low for AL amyloidosis, Castleman disease, Erdheim-Chester disease, and POEMS syndrome. Interestingly, higher accuracy often correlated with lower ranking stability across most LLMs. The physician performance showed a strong correlation with both Top-10 accuracy (ρ = 0.565) and MRR (ρ = 0.650). In the prospective phase, LLMs significantly improved the diagnostic accuracy of less-experienced physicians, raising their performance to specialist levels; no significant benefit was observed for specialists. However, when LLMs generated biased responses, physician performance often failed to improve or even declined.

Conclusions:

Without fine-tuning, new-generation commercial LLMs can identify correct diagnoses for rare hematologic diseases with accuracy comparable to that of physicians and can elevate the diagnostic performance of less-experienced physicians to specialist levels. Nevertheless, biased LLM outputs may mislead clinicians, highlighting the need for critical appraisal and cautious clinical integration. Clinical Trial: Chinese Clinical Trial Registry Identifier: ChiCTR2400089959.


 Citation

Please cite as:

Yu H, Chen T, Zhang X, Yang Y, Liu Q, Yang C, Shen K, Li H, Tang W, Zhong X, Shuai X, Yu X, Liao Y, Wang C, Zhu H, Wu Y

Performance of Large Language Models in Diagnosing Rare Hematologic Diseases and the Impact of Their Diagnostic Outputs on Physicians: Combined Retrospective and Prospective Study

J Med Internet Res 2025;27:e77334

DOI: 10.2196/77334

PMID: 41070713

PMCID: 12511990

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.