Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: May 4, 2025
Open Peer Review Period: May 4, 2025 - Jun 29, 2025
Date Accepted: Oct 13, 2025
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Named Entity Recognition for Chinese Cancer Electronic Health Records—Development and Evaluation of a Domain-Specific BERT Model: Quantitative Study

Junbai C, Zhao B, Tian X, Zou Z, Wang R, Wu J, Du S, Guo F

Named Entity Recognition for Chinese Cancer Electronic Health Records—Development and Evaluation of a Domain-Specific BERT Model: Quantitative Study

JMIR Med Inform 2025;13:e76912

DOI: 10.2196/76912

PMID: 41237336

PMCID: 12620309

Named Entity Recognition in Chinese Cancer Electronic Medical Records: Development of a Hybrid Neural Network Using a Domain-Specific Bidirectional Encoder Representations from Transformers Model

  • Chen Junbai; 
  • Butian Zhao; 
  • Xiaohan Tian; 
  • Zhengkai Zou; 
  • Ruojia Wang; 
  • Jiarui Wu; 
  • Songxing Du; 
  • Fengying Guo

ABSTRACT

Background:

The unstructured data of Chinese cancer electronic medical records contains valuable medical expertise. Accurate medical entity recognition is crucial for building a medical-assisted decision system. Named entity recognition (NER) in cancer electronic medical records (EMRs) typically employs general models designed for English medical records. There is a lack of specialized handling for cancer-specific records and limited application to Chinese medical records.

Objective:

This study proposes a specific NER model to enhance the recognition of medical entities in Chinese cancer electronic medical records.

Methods:

Desensitized inpatient electronic medical records related to breast cancer were collected from a leading hospital in Beijing. Building upon the MC-BERT foundation, the study further incorporated a Chinese cancer corpus for pretraining, resulting in the construction of the ChCancerBERT pretrained model. In conjunction with Dilated-Gated Convolutional Neural Networks, Bidirectional Long Short-Term Memory, Multi-head attention mechanism, and Conditional random field, this model forms a multi-model, multi-level integrated named entity recognition approach.

Results:

This approach effectively extracts medical entity features related to symptoms, signs, tests, treatments, and time in Chinese breast cancer electronic medical records. The entity recognition performance of the proposed model surpasses that of the baseline model and other models compared in the experiment. The F1 score reached 86.93%, precision reached 87.24%, and recall reached 86.61%. The model introduced in this study demonstrates exceptional performance on the CCKS2019 dataset, attaining a precision rate of 87.26%, a recall rate of 87.27%, and an impressive F1 score of 87.26%, surpassing that of existing models.

Conclusions:

The experiments demonstrate that the approach proposed in this study exhibits excellent performance in named entity recognition within breast cancer electronic medical records. This advancement will further contribute to clinical decision support for cancer treatment and research. Additionally, the study reveals that incorporating domain-specific corpora in clinical named entity recognition tasks can further enhance the performance of BERT models in specialized domains.


 Citation

Please cite as:

Junbai C, Zhao B, Tian X, Zou Z, Wang R, Wu J, Du S, Guo F

Named Entity Recognition for Chinese Cancer Electronic Health Records—Development and Evaluation of a Domain-Specific BERT Model: Quantitative Study

JMIR Med Inform 2025;13:e76912

DOI: 10.2196/76912

PMID: 41237336

PMCID: 12620309

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.