Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Formative Research

Date Submitted: Jul 6, 2025
Open Peer Review Period: Jul 8, 2025 - Sep 2, 2025
Date Accepted: Nov 27, 2025
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Digitally Assisted Clinical Decision-Making in Traditional Chinese Medicine: Comparative Study of 5 Large Language Models

Liu W, Miao S, Ma Q, Li Y, Deng Y, Wang X, Zhang X, Ma N, Miao H, Si Y, Shi Q, Zhu L, Shang H, Wang Y

Digitally Assisted Clinical Decision-Making in Traditional Chinese Medicine: Comparative Study of 5 Large Language Models

JMIR Form Res 2026;10:e80167

DOI: 10.2196/80167

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Digital-Assisted Clinical Decision Making in Traditional Chinese Medicine: Benchmark Testing of Five Large Language Models and Evaluation of Human-AI Collaborative Clinical Decision-Making

  • Weiwei Liu; 
  • Shuchang Miao; 
  • Qun Ma; 
  • Yuxin Li; 
  • Yinxiang Deng; 
  • Xiaoqiu Wang; 
  • Xinwei Zhang; 
  • Nuosha Ma; 
  • Hanchi Miao; 
  • Yang Si; 
  • Qingxia Shi; 
  • Lin Zhu; 
  • Hongtao Shang; 
  • Yue Wang

ABSTRACT

Background:

Traditional Chinese Medicine (TCM) clinical decision-making involves complex integration of syndrome differentiation, constitutional assessment, and individualized treatment selection, creating challenges for standardization and quality assurance. While large language models demonstrate remarkable capabilities in medical knowledge integration and clinical reasoning, their application to TCM remains largely unexplored, particularly regarding syndrome differentiation principles and prescription formulation logic.

Objective:

This study aimed to evaluate five contemporary large language models in TCM clinical decision-making and assess the effectiveness of human-AI collaboration compared to independent decision-making approaches. Specific objectives were to benchmark LLM performance in TCM knowledge assessment, evaluate clinical case analysis capabilities, identify the optimal performing model, and assess the quality, efficiency, and acceptability of human-AI collaborative decision-making.

Methods:

Five mainstream large language models were evaluated: Claude 3.7 Sonnet-Extended, ChatGPT 4.5, Grok3-DeepSearch, Gemini 2.0 Flash Thinking Experimental, and DeepSeek-R1. The evaluation employed a four-phase methodology: (1) TCM knowledge assessment using 160 standardized examination questions, (2) clinical case analysis of 30 cases representing different disease systems and complexity levels, (3) optimal model selection using weighted scoring (40% knowledge, 60% clinical analysis), and (4) clinical application assessment involving 10 TCM practitioners and 2 experts comparing physician-only, AI-only, and human-AI collaboration approaches across 5 clinical cases. Statistical analysis included descriptive statistics, reliability analysis, comparative testing, and regression analysis.

Results:

DeepSeek-R1 demonstrated superior performance across both evaluation domains, achieving 96.7% accuracy in knowledge assessment and 17.31/20 mean score in clinical case analysis, significantly outperforming other models (P<.001). Human-AI collaboration achieved significant improvements compared to physician-only decision-making, with 16.1% quality enhancement (mean scores: 33.62 vs 28.97, P<.001) and 66.1% time reduction (162.6s vs 479.2s, P<.001). System usability was rated favorably (SUS score: 76.8, P=.002), with high collaboration acceptance rates (74.25% adoption, 24.0% modification, 1.75% rejection). AI assistance provided greatest benefits in prescription formulation and medication selection domains (P<.001).

Conclusions:

Large language models, particularly DeepSeek-R1, demonstrate substantial capabilities in TCM knowledge assessment and clinical case analysis. Human-AI collaboration significantly enhanced clinical decision-making quality and efficiency while maintaining high physician acceptance. These findings provide compelling evidence for the clinical value of AI-assisted decision-making in traditional Chinese medicine, suggesting potential solutions to current challenges in knowledge standardization, clinical training, and healthcare delivery efficiency. Strategic implementation of AI assistance could significantly enhance the quality, efficiency, and accessibility of TCM care while preserving fundamental principles of individualized treatment.


 Citation

Please cite as:

Liu W, Miao S, Ma Q, Li Y, Deng Y, Wang X, Zhang X, Ma N, Miao H, Si Y, Shi Q, Zhu L, Shang H, Wang Y

Digitally Assisted Clinical Decision-Making in Traditional Chinese Medicine: Comparative Study of 5 Large Language Models

JMIR Form Res 2026;10:e80167

DOI: 10.2196/80167

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.