Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Formative Research

Date Submitted: Jul 30, 2025
Date Accepted: Nov 21, 2025

The final, peer-reviewed published version of this preprint can be found here:

Large Language Model Evaluation in Traditional Chinese Medicine for Stroke: Quantitative Benchmarking Study

Long H, Deng Y, Guo Y, Shen Z, Zhang Y, Bao J, He Y

Large Language Model Evaluation in Traditional Chinese Medicine for Stroke: Quantitative Benchmarking Study

JMIR Form Res 2025;9:e81545

DOI: 10.2196/81545

PMID: 41380151

PMCID: 12741655

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Construction and Empirical Study of an Evaluation Dataset for Large Language Models in the Field of TCM Stroke

  • Hulin Long; 
  • Yang Deng; 
  • Yaoguang Guo; 
  • Zifan Shen; 
  • Yuzhu Zhang; 
  • Ji Bao; 
  • Yang He

ABSTRACT

Background:

The application of Large Language Models (LLMs) in the medical field is rapidly advancing. However, effectively and comprehensively evaluating the capabilities of LLMs in specialized domains like Traditional Chinese Medicine (TCM), which possesses a unique theoretical system and cognitive framework, remains a significant challenge.

Objective:

This study aims to construct a specialized evaluation dataset for the field of TCM stroke and conduct an empirical study to reveal the capabilities and limitations of different types of LLMs in this domain.

Methods:

We systematically constructed an evaluation tool named the "Traditional Chinese Medicine - Stroke Evaluation Dataset" (TCM-SED). The dataset comprises 203 questions, including three paradigms: short-answer questions, multiple-choice questions (single and multiple selections), and essay questions. It covers multiple dimensions of TCM stroke knowledge, such as diagnosis, pattern differentiation and treatment, herbal formulas, acupuncture, interpretation of classic texts, and patient communication. The "golden standard answers" for all questions were established through a cross-validation and consensus process involving multiple senior TCM experts. We used TCM-SED to comprehensively test two representative models: GPT-4o, a leading international general-purpose large model, and DeepSeek-R1, a large model primarily trained on Chinese corpora.

Results:

The test results revealed a differentiation in the capabilities of the two models across different cognitive-level tasks. In the objective question sections, which emphasize precise knowledge recall and discrimination, DeepSeek-R1 outperformed GPT-4o comprehensively, with an accuracy lead of over 17 percentage points in the multiple-choice section (70.07% vs. 52.55%). Conversely, in the essay question section, which requires knowledge integration, complex reasoning, and long-text generation, GPT-4o's performance significantly surpassed that of DeepSeek-R1. For instance, in the "Interpretation of Classic Texts" category, GPT-4o achieved a scoring rate of 90.5%, far exceeding DeepSeek-R1's 73.5%.

Conclusions:

This study demonstrates that large models trained with a Chinese-centric corpus have a significant advantage in handling "static knowledge" tasks within the TCM domain, whereas leading general-purpose models exhibit stronger capabilities in complex tasks requiring "dynamic reasoning" and content generation. The successfully constructed TCM-SED not only provides an effective quantitative tool for evaluating and selecting appropriate LLMs for various TCM scenarios but also offers a valuable data foundation and a new research direction for their future optimization and alignment.


 Citation

Please cite as:

Long H, Deng Y, Guo Y, Shen Z, Zhang Y, Bao J, He Y

Large Language Model Evaluation in Traditional Chinese Medicine for Stroke: Quantitative Benchmarking Study

JMIR Form Res 2025;9:e81545

DOI: 10.2196/81545

PMID: 41380151

PMCID: 12741655

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.