Accepted for/Published in: JMIR Medical Informatics
Date Submitted: May 4, 2025
Open Peer Review Period: May 4, 2025 - Jun 29, 2025
Date Accepted: Oct 13, 2025
(closed for review but you can still tweet)
Named Entity Recognition in Chinese Cancer Electronic Medical Records: Development of a Hybrid Neural Network Using a Domain-Specific Bidirectional Encoder Representations from Transformers Model
ABSTRACT
Background:
The unstructured data of Chinese cancer electronic medical records contains valuable medical expertise. Accurate medical entity recognition is crucial for building a medical-assisted decision system. Named entity recognition (NER) in cancer electronic medical records (EMRs) typically employs general models designed for English medical records. There is a lack of specialized handling for cancer-specific records and limited application to Chinese medical records.
Objective:
This study proposes a specific NER model to enhance the recognition of medical entities in Chinese cancer electronic medical records.
Methods:
Desensitized inpatient electronic medical records related to breast cancer were collected from a leading hospital in Beijing. Building upon the MC-BERT foundation, the study further incorporated a Chinese cancer corpus for pretraining, resulting in the construction of the ChCancerBERT pretrained model. In conjunction with Dilated-Gated Convolutional Neural Networks, Bidirectional Long Short-Term Memory, Multi-head attention mechanism, and Conditional random field, this model forms a multi-model, multi-level integrated named entity recognition approach.
Results:
This approach effectively extracts medical entity features related to symptoms, signs, tests, treatments, and time in Chinese breast cancer electronic medical records. The entity recognition performance of the proposed model surpasses that of the baseline model and other models compared in the experiment. The F1 score reached 86.93%, precision reached 87.24%, and recall reached 86.61%. The model introduced in this study demonstrates exceptional performance on the CCKS2019 dataset, attaining a precision rate of 87.26%, a recall rate of 87.27%, and an impressive F1 score of 87.26%, surpassing that of existing models.
Conclusions:
The experiments demonstrate that the approach proposed in this study exhibits excellent performance in named entity recognition within breast cancer electronic medical records. This advancement will further contribute to clinical decision support for cancer treatment and research. Additionally, the study reveals that incorporating domain-specific corpora in clinical named entity recognition tasks can further enhance the performance of BERT models in specialized domains.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.