Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIRx Med

Date Submitted: Mar 26, 2025
Open Peer Review Period: Mar 24, 2025 - May 19, 2025
Date Accepted: Aug 19, 2025
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Assessing the Limitations of Large Language Models in Clinical Practice Guideline–Concordant Treatment Decision-Making on Real-World Data: Retrospective Study

Roeschl T, Hoffmann M, Hashemi D, Rarreck F, Hinrichs N, Trippel TD, Gröschel MI, Unbehaun A, Klein C, Kempfert J, Dreger H, O’Brien B, Hindricks G, Balzer F, Falk V, Meyer A

Assessing the Limitations of Large Language Models in Clinical Practice Guideline–Concordant Treatment Decision-Making on Real-World Data: Retrospective Study

JMIRx Med 2025;6:e74899

DOI: 10.2196/74899

PMID: 41190890

PMCID: 12587749

Assessing the Limitations of Large Language Models in Clinical Practice Guideline-concordant Treatment Decision-making on Real-world Data

  • Tobias Roeschl; 
  • Marie Hoffmann; 
  • Djawid Hashemi; 
  • Felix Rarreck; 
  • Nils Hinrichs; 
  • Tobias Daniel Trippel; 
  • Matthias I. Gröschel; 
  • Axel Unbehaun; 
  • Christoph Klein; 
  • Jörg Kempfert; 
  • Henryk Dreger; 
  • Benjamin O’Brien; 
  • Gerhard Hindricks; 
  • Felix Balzer; 
  • Volkmar Falk; 
  • Alexander Meyer

ABSTRACT

Background:

Large Language Models (LLMs) have shown promise in therapeutic decision-making comparable to medical experts, but these studies have used highly curated patient data.

Objective:

The aim of this study was to determine whether LLMs can make guideline-concordant treatment decisions based on patient data as it is typically presented in clinical practice.

Methods:

We conducted a retrospective study of 80 patients with severe aortic stenosis who were scheduled for either surgical (SAVR, n=24) or transcatheter aortic valve replacement (TAVR, n=56) by our institutional heart team in 2022. Various LLMs (BioGPT, GPT-3.5, GPT-4, GPT-4 Turbo, GPT-4o, Llama-2, Mistral, PaLM 2, and DeepSeek-R1) were queried using either anonymized original medical reports or manually generated case summaries to determine the most guideline-concordant treatment. Agreement with the Heart Team was measured using Cohen's kappa coefficients, reliability using intraclass correlation coefficients (ICCs), and fairness using frequency bias indices (FBIs) with FBIs >1 indicating bias towards TAVR.

Results:

When presented with original medical reports, LLMs showed poor performance (kappa: -0.47–0.22, ICC: 0.0–1.0, FBI: 0.95–1.51). The LLMs’ performance improved substantially when case summaries were used as input and additional guideline knowledge was added to the prompt (kappa: -0.02–0.63, ICC: 0.01–1.0, FBI: 0.46–1.23). Qualitative analysis revealed instances of hallucinations in all LLMs tested.

Conclusions:

Even advanced LLMs require extensively curated input for informed treatment decisions. Unreliable responses, bias and hallucinations pose significant health risks and highlight the need for caution in applying LLMs to real-world clinical decision-making.


 Citation

Please cite as:

Roeschl T, Hoffmann M, Hashemi D, Rarreck F, Hinrichs N, Trippel TD, Gröschel MI, Unbehaun A, Klein C, Kempfert J, Dreger H, O’Brien B, Hindricks G, Balzer F, Falk V, Meyer A

Assessing the Limitations of Large Language Models in Clinical Practice Guideline–Concordant Treatment Decision-Making on Real-World Data: Retrospective Study

JMIRx Med 2025;6:e74899

DOI: 10.2196/74899

PMID: 41190890

PMCID: 12587749

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.