Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Mar 14, 2025
Open Peer Review Period: Mar 14, 2025 - May 9, 2025
Date Accepted: Jun 10, 2025
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Application of Large Language Models in Complex Clinical Cases: Cross-Sectional Evaluation Study

Huang Y, Yang G, Shen Y, Chen H, Wu W, Wu Y, Zhang K, Xu J, Li X, Zhang J

Application of Large Language Models in Complex Clinical Cases: Cross-Sectional Evaluation Study

JMIR Med Inform 2025;13:e73941

DOI: 10.2196/73941

PMID: 41055081

PMCID: 12501899

Application of Large Language Models in Complex Clinical Cases: Cross-Sectional Evaluation Study

  • Yuanheng Huang; 
  • Guozhen Yang; 
  • Yahui Shen; 
  • Huiguo Chen; 
  • Weibin Wu; 
  • Yonghui Wu; 
  • Kai Zhang; 
  • Jiannan Xu; 
  • Xiaojun Li; 
  • Jian Zhang

ABSTRACT

Background:

Large language models (LLMs) have made significant advancements in natural language processing (NLP) and are gradually showing potential for application in the medical field. However, LLMs still face challenges in medicine.

Objective:

This study aims to evaluate the efficiency, accuracy, and cost of LLMs in handling complex medical cases and to analyze their prospects and potential value as clinical decision support tools.

Methods:

We selected cases from the database of the Department of cardiothoracic Surgery, the Third Affiliated Hospital of Sun Yat-sen University (2021-2024), and conducted a multidimensional preliminary evaluation of the latest LLMs in clinical decision-making for complex cases. The evaluation included measuring the time taken for the LLMs to generate decision recommendations, Likert scores, and calculating decision costs to assess the execution efficiency, accuracy, and cost-effectiveness of the models.

Results:

A total of 80 complex cases were included in this study, and the performance of multiple LLMs in clinical decision-making was evaluated. Experts required 33.60 minutes on average, far longer than any LLM. GPT-o1 (0.71 minutes), GPT4o (0.88), and Deepseek (0.94) all finished under a minute without statistical differences. Although Kimi, Gemini, LLaMa3-8B, and LLaMa3-70B took 1.02–3.20 minutes, they were still faster than experts. In terms of decision accuracy, Deepseek-R1 had the highest accuracy, with no significant difference compared to GPTo1 (p = 0.699), and both performed significantly better than GPT4o, Kimi, Gemini, LLaMa3-70B, and LLaMa3-8B (p < 0.001). Regarding decision costs, all LLMs showed significantly lower costs than the MDT (Multidisciplinary Team), with open-source models such as Deepseek-R1 offering a zero direct cost advantage.

Conclusions:

GPT-o1 and Deepseek-R1 show strong clinical potential, boosting efficiency, maintaining accuracy, and reducing costs. GPT4o and Kimi performed moderately, indicating suitability for broader clinical tasks. Further research is needed to validate LLaMa3 series and Gemini in clinical decision.


 Citation

Please cite as:

Huang Y, Yang G, Shen Y, Chen H, Wu W, Wu Y, Zhang K, Xu J, Li X, Zhang J

Application of Large Language Models in Complex Clinical Cases: Cross-Sectional Evaluation Study

JMIR Med Inform 2025;13:e73941

DOI: 10.2196/73941

PMID: 41055081

PMCID: 12501899

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.