Accepted for/Published in: JMIR Formative Research
Date Submitted: Dec 26, 2024
Date Accepted: Aug 12, 2025
Identification of Syndrome Types in Pancreatic Cancer Patients from Electronic Medical Record Free Text: Model Development and Validation
ABSTRACT
Background:
Background:
Syndrome differentiation plays a crucial role in traditional Chinese medicine (TCM) diagnosis and treatment planning. However, this process is highly dependent on expert experience, thereby limiting systematically standardization.
Objective:
Objective:
The present study established a Bidirectional Encoder Representations from Transformers (BERT)-based TCM syndrome differentiation model (PCSD-BERT) with validation using in-house pancreatic cancer medical records. This model aims to digitalize expert knowledge, enabling its storage and reuse to support standardized syndrome differentiation in TCM clinical practice.
Methods:
Methods:
This study retrospectively collected pancreatic cancer case records from the Department of Integrative Oncology at Fudan University Shanghai Cancer Center between 2011 and 2024. Feature engineering was conducted based on relevant guidelines and expert knowledge, and syndromes with at least 500 case records were included for training. PCSD-BERT was trained using a masked language model (MLM) and multi-class classification tasks, with ten-fold cross-validation to enhance generalizability. Comparative analyses were conducted between PCSD-BERT and commonly used language models embedded in existing TCM diagnostic tools (LSTM and Text-CNN), a BERT model without fine-tuning, and various large language models (LLMs) utilizing Prompt engineering, including ChatGPT 4, ChatGPT 4o, ChatGPT o1-Pro, Kimi, Ernie Bot 4.0 Turbo, HuaTuoGPT II, and Zhipu Qingyan. After training, PCSD-BERT’s syndrome differentiation performance was evaluated in practical applications using in-house data, with attention mechanism visualizations to observe word association patterns in syndrome differentiation tasks. Additionally, integrated gradients were employed to assess the model’s capability in associating terms with syndrome labels.
Results:
Results:
Following model establishment, a total of 6,830 case records were included, defining four syndrome labels. In the test dataset, PCSD-BERT demonstrated superior performance over all baseline models and LLMs utilizing Prompt engineering, with a Precision of 0.955±0.020, Recall of 0.935±0.039, F1-score of 0.951±0.23, and Accuracy of 0.919±0.025. The results demonstrated PCSD-BERT yielded syndrome differentiation results consistent with expert diagnoses across all syndrome categories. Visualization of the attention mechanism indicated that the model effectively identified relationships among TCM terms, constructing accurate inter-word associations. Integrated gradient analysis further revealed a high degree of concordance between the model’s predictions and clinical criteria, supporting alignment with TCM diagnostic principles.
Conclusions:
Conclusions:
The PCSD-BERT model demonstrated precise identification of TCM symptoms and syndrome patterns in medical case records, showcasing its irreplaceable efficiency in syndrome differentiation compared to LLMs and the embedded models in TCM diagnostic tools. This model has preliminarily achieved the digital storage and standardized application of expert knowledge, laying a foundation for multimodal integration tasks related to syndrome differentiation.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.