JMIR Preprints #81545: Construction and Empirical Study of an Evaluation Dataset for Large Language Models in the Field of TCM Stroke

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Construction and Empirical Study of an Evaluation Dataset for Large Language Models in the Field of TCM Stroke

Hulin Long;
Yang Deng;
Yaoguang Guo;
Zifan Shen;
Yuzhu Zhang;
Ji Bao;
Yang He

ABSTRACT

Background:

The application of Large Language Models (LLMs) in the medical field is rapidly advancing. However, effectively and comprehensively evaluating the capabilities of LLMs in specialized domains like Traditional Chinese Medicine (TCM), which possesses a unique theoretical system and cognitive framework, remains a significant challenge.

Objective:

This study aims to construct a specialized evaluation dataset for the field of TCM stroke and conduct an empirical study to reveal the capabilities and limitations of different types of LLMs in this domain.

Methods:

We systematically constructed an evaluation tool named the "Traditional Chinese Medicine - Stroke Evaluation Dataset" (TCM-SED). The dataset comprises 203 questions, including three paradigms: short-answer questions, multiple-choice questions (single and multiple selections), and essay questions. It covers multiple dimensions of TCM stroke knowledge, such as diagnosis, pattern differentiation and treatment, herbal formulas, acupuncture, interpretation of classic texts, and patient communication. The "golden standard answers" for all questions were established through a cross-validation and consensus process involving multiple senior TCM experts. We used TCM-SED to comprehensively test two representative models: GPT-4o, a leading international general-purpose large model, and DeepSeek-R1, a large model primarily trained on Chinese corpora.

Results:

The test results revealed a differentiation in the capabilities of the two models across different cognitive-level tasks. In the objective question sections, which emphasize precise knowledge recall and discrimination, DeepSeek-R1 outperformed GPT-4o comprehensively, with an accuracy lead of over 17 percentage points in the multiple-choice section (70.07% vs. 52.55%). Conversely, in the essay question section, which requires knowledge integration, complex reasoning, and long-text generation, GPT-4o's performance significantly surpassed that of DeepSeek-R1. For instance, in the "Interpretation of Classic Texts" category, GPT-4o achieved a scoring rate of 90.5%, far exceeding DeepSeek-R1's 73.5%.

Conclusions:

This study demonstrates that large models trained with a Chinese-centric corpus have a significant advantage in handling "static knowledge" tasks within the TCM domain, whereas leading general-purpose models exhibit stronger capabilities in complex tasks requiring "dynamic reasoning" and content generation. The successfully constructed TCM-SED not only provides an effective quantitative tool for evaluating and selecting appropriate LLMs for various TCM scenarios but also offers a valuable data foundation and a new research direction for their future optimization and alignment.

Citation

Please cite as:

Long H, Deng Y, Guo Y, Shen Z, Zhang Y, Bao J, He Y

Large Language Model Evaluation in Traditional Chinese Medicine for Stroke: Quantitative Benchmarking Study

JMIR Form Res 2025;9:e81545

DOI: 10.2196/81545

PMID: 41380151

PMCID: 12741655

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Formative Research

Date Submitted: Jul 30, 2025

Date Accepted: Nov 21, 2025

Construction and Empirical Study of an Evaluation Dataset for Large Language Models in the Field of TCM Stroke

ABSTRACT

Citation

Copyright