JMIR Preprints #80167: Digital-Assisted Clinical Decision Making in Traditional Chinese Medicine: Benchmark Testing of Five Large Language Models and Evaluation of Human-AI Collaborative Clinical Decision-Making

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Digital-Assisted Clinical Decision Making in Traditional Chinese Medicine: Benchmark Testing of Five Large Language Models and Evaluation of Human-AI Collaborative Clinical Decision-Making

Weiwei Liu;
Shuchang Miao;
Qun Ma;
Yuxin Li;
Yinxiang Deng;
Xiaoqiu Wang;
Xinwei Zhang;
Nuosha Ma;
Hanchi Miao;
Yang Si;
Qingxia Shi;
Lin Zhu;
Hongtao Shang;
Yue Wang

ABSTRACT

Background:

Traditional Chinese Medicine (TCM) clinical decision-making involves complex integration of syndrome differentiation, constitutional assessment, and individualized treatment selection, creating challenges for standardization and quality assurance. While large language models demonstrate remarkable capabilities in medical knowledge integration and clinical reasoning, their application to TCM remains largely unexplored, particularly regarding syndrome differentiation principles and prescription formulation logic.

Objective:

This study aimed to evaluate five contemporary large language models in TCM clinical decision-making and assess the effectiveness of human-AI collaboration compared to independent decision-making approaches. Specific objectives were to benchmark LLM performance in TCM knowledge assessment, evaluate clinical case analysis capabilities, identify the optimal performing model, and assess the quality, efficiency, and acceptability of human-AI collaborative decision-making.

Methods:

Five mainstream large language models were evaluated: Claude 3.7 Sonnet-Extended, ChatGPT 4.5, Grok3-DeepSearch, Gemini 2.0 Flash Thinking Experimental, and DeepSeek-R1. The evaluation employed a four-phase methodology: (1) TCM knowledge assessment using 160 standardized examination questions, (2) clinical case analysis of 30 cases representing different disease systems and complexity levels, (3) optimal model selection using weighted scoring (40% knowledge, 60% clinical analysis), and (4) clinical application assessment involving 10 TCM practitioners and 2 experts comparing physician-only, AI-only, and human-AI collaboration approaches across 5 clinical cases. Statistical analysis included descriptive statistics, reliability analysis, comparative testing, and regression analysis.

Results:

DeepSeek-R1 demonstrated superior performance across both evaluation domains, achieving 96.7% accuracy in knowledge assessment and 17.31/20 mean score in clinical case analysis, significantly outperforming other models (P<.001). Human-AI collaboration achieved significant improvements compared to physician-only decision-making, with 16.1% quality enhancement (mean scores: 33.62 vs 28.97, P<.001) and 66.1% time reduction (162.6s vs 479.2s, P<.001). System usability was rated favorably (SUS score: 76.8, P=.002), with high collaboration acceptance rates (74.25% adoption, 24.0% modification, 1.75% rejection). AI assistance provided greatest benefits in prescription formulation and medication selection domains (P<.001).

Conclusions:

Large language models, particularly DeepSeek-R1, demonstrate substantial capabilities in TCM knowledge assessment and clinical case analysis. Human-AI collaboration significantly enhanced clinical decision-making quality and efficiency while maintaining high physician acceptance. These findings provide compelling evidence for the clinical value of AI-assisted decision-making in traditional Chinese medicine, suggesting potential solutions to current challenges in knowledge standardization, clinical training, and healthcare delivery efficiency. Strategic implementation of AI assistance could significantly enhance the quality, efficiency, and accessibility of TCM care while preserving fundamental principles of individualized treatment.

Citation

Please cite as:

Liu W, Miao S, Ma Q, Li Y, Deng Y, Wang X, Zhang X, Ma N, Miao H, Si Y, Shi Q, Zhu L, Shang H, Wang Y

Digitally Assisted Clinical Decision-Making in Traditional Chinese Medicine: Comparative Study of 5 Large Language Models

JMIR Form Res 2026;10:e80167

DOI: 10.2196/80167

PMID: 41814991

PMCID: 12954686

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Formative Research

Date Submitted: Jul 6, 2025

Open Peer Review Period: Jul 8, 2025 - Sep 2, 2025

Date Accepted: Nov 27, 2025

(closed for review but you can still tweet)

Digital-Assisted Clinical Decision Making in Traditional Chinese Medicine: Benchmark Testing of Five Large Language Models and Evaluation of Human-AI Collaborative Clinical Decision-Making

ABSTRACT

Citation

Copyright