Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Mar 14, 2025
Open Peer Review Period: Mar 14, 2025 - May 9, 2025
Date Accepted: Jun 10, 2025
(closed for review but you can still tweet)
Application of Large Language Models in Complex Clinical Cases: Cross-Sectional Evaluation Study
ABSTRACT
Background:
Large language models (LLMs) have made significant advancements in natural language processing (NLP) and are gradually showing potential for application in the medical field. However, LLMs still face challenges in medicine.
Objective:
This study aims to evaluate the efficiency, accuracy, and cost of LLMs in handling complex medical cases and to analyze their prospects and potential value as clinical decision support tools.
Methods:
We selected cases from the database of the Department of cardiothoracic Surgery, the Third Affiliated Hospital of Sun Yat-sen University (2021-2024), and conducted a multidimensional preliminary evaluation of the latest LLMs in clinical decision-making for complex cases. The evaluation included measuring the time taken for the LLMs to generate decision recommendations, Likert scores, and calculating decision costs to assess the execution efficiency, accuracy, and cost-effectiveness of the models.
Results:
A total of 80 complex cases were included in this study, and the performance of multiple LLMs in clinical decision-making was evaluated. Experts required 33.60 minutes on average, far longer than any LLM. GPT-o1 (0.71 minutes), GPT4o (0.88), and Deepseek (0.94) all finished under a minute without statistical differences. Although Kimi, Gemini, LLaMa3-8B, and LLaMa3-70B took 1.02–3.20 minutes, they were still faster than experts. In terms of decision accuracy, Deepseek-R1 had the highest accuracy, with no significant difference compared to GPTo1 (p = 0.699), and both performed significantly better than GPT4o, Kimi, Gemini, LLaMa3-70B, and LLaMa3-8B (p < 0.001). Regarding decision costs, all LLMs showed significantly lower costs than the MDT (Multidisciplinary Team), with open-source models such as Deepseek-R1 offering a zero direct cost advantage.
Conclusions:
GPT-o1 and Deepseek-R1 show strong clinical potential, boosting efficiency, maintaining accuracy, and reducing costs. GPT4o and Kimi performed moderately, indicating suitability for broader clinical tasks. Further research is needed to validate LLaMa3 series and Gemini in clinical decision.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.