Accepted for/Published in: JMIR Formative Research
Date Submitted: Sep 15, 2024
Date Accepted: Jun 4, 2025
Performance Assessment of ChatGPT-4.0 and ChatGLM Series in Traditional Chinese Medicine for Metabolic Associated Fatty Liver Disease: A Comparative Study
ABSTRACT
Background:
ChatGPT-4.0 (OpenAI) and the ChatGLM series (Tsinghua/Zhipu AI) are novel conversational LLMs. ChatGLM includes three versions: ChatGLM4 (with internet connectivity but no knowledge base pre-training), ChatGLM4+Knowledge base (combining internet search capabilities with knowledge base pre-training), ChatGLM3-6B (offline knowledge base pre-training but no internet connectivity). The ability of ChatGPT4.0 and ChatGLM to apply medical knowledge in the Chinese environment has been preliminarily verified, but the potential of the two models for clinical assistance in traditional Chinese medicine (TCM) is still unknown.
Objective:
This study evaluated four LLMs by providing them with medical records of 87 MAFLD cases treated with TCM and querying them about TCM treatment plans. The answering texts from four LLMs were evaluated using predefined scoring criteria, focusing on three critical dimensions: ability in syndrome differentiation and treatment principles, confusion of concepts between TCM and Western Medicine, and comprehensive evaluation of question-answering texts (comprising six components: ability to integrate Chinese and Western Medicine; ability to formulate treatment plans; health management capacity; disease monitoring ability; self-positioning awareness; and medication safety).
Methods:
Using 87 cases of successful treatment of metabolic dysfunction-associated fatty liver disease (MAFLD) by TCM, four kinds of large language models were tested, and the question-answering texts of language models were comprehensively evaluated.
Results:
In the evaluation module of "Ability in syndrome differentiation and treatment principles," the performance ranking of the four models was: ChatGLM4+ Knowledge Base > ChatGLM4 > ChatGLM3-6B > ChatGPT4.0. Regarding the assessment of confusion between TCM and Western medicine concepts, ChatGPT4.0 exhibited conceptual confusion in 32 out of 87 cases, while the ChatGLM series of LLMs showed no such confusion (except for ChatGLM3-6B, which had one instance). In the "Comprehensive evaluation of question-answering texts" module (comprising six components: ability to integrate Chinese and Western Medicine; ability to formulate treatment plans; health management capacity; disease monitoring ability; self-positioning awareness; and medication safety), the ranking was: ChatGLM4+ Knowledge Base > ChatGPT4.0 > ChatGLM4 > ChatGLM3-6B.
Conclusions:
Our study results demonstrated that real-time internet connectivity played a critical role in LLM-assisted TCM diagnosis and treatment, while offline models showed significantly reduced performance in clinical decision support. Moreover, pre-training LLMs with TCM-specific knowledge bases while maintaining internet search capabilities substantially enhanced their diagnostic and therapeutic performance in TCM applications. Importantly, general-purpose LLMs required both domain-specific medical fine-tuning and culturally-sensitive adaptation to meet the rigorous standards of TCM clinical practice.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.