Accepted for/Published in: JMIR Formative Research
Date Submitted: Jul 30, 2025
Date Accepted: Nov 21, 2025
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Construction and Empirical Study of an Evaluation Dataset for Large Language Models in the Field of TCM Stroke
ABSTRACT
Background:
The application of Large Language Models (LLMs) in the medical field is rapidly advancing. However, effectively and comprehensively evaluating the capabilities of LLMs in specialized domains like Traditional Chinese Medicine (TCM), which possesses a unique theoretical system and cognitive framework, remains a significant challenge.
Objective:
This study aims to construct a specialized evaluation dataset for the field of TCM stroke and conduct an empirical study to reveal the capabilities and limitations of different types of LLMs in this domain.
Methods:
We systematically constructed an evaluation tool named the "Traditional Chinese Medicine - Stroke Evaluation Dataset" (TCM-SED). The dataset comprises 203 questions, including three paradigms: short-answer questions, multiple-choice questions (single and multiple selections), and essay questions. It covers multiple dimensions of TCM stroke knowledge, such as diagnosis, pattern differentiation and treatment, herbal formulas, acupuncture, interpretation of classic texts, and patient communication. The "golden standard answers" for all questions were established through a cross-validation and consensus process involving multiple senior TCM experts. We used TCM-SED to comprehensively test two representative models: GPT-4o, a leading international general-purpose large model, and DeepSeek-R1, a large model primarily trained on Chinese corpora.
Results:
The test results revealed a differentiation in the capabilities of the two models across different cognitive-level tasks. In the objective question sections, which emphasize precise knowledge recall and discrimination, DeepSeek-R1 outperformed GPT-4o comprehensively, with an accuracy lead of over 17 percentage points in the multiple-choice section (70.07% vs. 52.55%). Conversely, in the essay question section, which requires knowledge integration, complex reasoning, and long-text generation, GPT-4o's performance significantly surpassed that of DeepSeek-R1. For instance, in the "Interpretation of Classic Texts" category, GPT-4o achieved a scoring rate of 90.5%, far exceeding DeepSeek-R1's 73.5%.
Conclusions:
This study demonstrates that large models trained with a Chinese-centric corpus have a significant advantage in handling "static knowledge" tasks within the TCM domain, whereas leading general-purpose models exhibit stronger capabilities in complex tasks requiring "dynamic reasoning" and content generation. The successfully constructed TCM-SED not only provides an effective quantitative tool for evaluating and selecting appropriate LLMs for various TCM scenarios but also offers a valuable data foundation and a new research direction for their future optimization and alignment.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.