Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Dec 1, 2022
Date Accepted: Mar 31, 2023

The final, peer-reviewed published version of this preprint can be found here:

Chinese Clinical Named Entity Recognition From Electronic Medical Records Based on Multisemantic Features by Using Robustly Optimized Bidirectional Encoder Representation From Transformers Pretraining Approach Whole Word Masking and Convolutional Neural Networks: Model Development and Validation

Wang W, Li X, Ren H, Gao D, Fang A

Chinese Clinical Named Entity Recognition From Electronic Medical Records Based on Multisemantic Features by Using Robustly Optimized Bidirectional Encoder Representation From Transformers Pretraining Approach Whole Word Masking and Convolutional Neural Networks: Model Development and Validation

JMIR Med Inform 2023;11:e44597

DOI: 10.2196/44597

PMID: 37163343

PMCID: 10209791

Chinese Clinical Named Entity Recognition from Electronic Medical Records based on Multi-semantic Features by using RoBERTa-wwm and CNN: Model Development and Validation

  • Weijie Wang; 
  • Xiaoying Li; 
  • Huiling Ren; 
  • Dongping Gao; 
  • An Fang

ABSTRACT

Background:

Clinical electronic medical records (EMRs) contain important medical information expressing patients' anatomy, symptoms, examinations, diagnoses, and medications. Large-scale mining of rich medical information from massive amounts of electronic medical record data will have significant reference value for medical research. With the complexity of Chinese grammar and the blurred boundaries of Chinese words, Chinese Clinical Named Entity Recognition (CNER) remains a significant challenge. The follow-up tasks, like medical entity structuring, medical entity standardization, medical entity relationship extraction, and medical knowledge graph construction, largely depend on the medical named entity recognition effects. A promising CNER result would provide reliable support for building domain knowledge graphs, knowledge bases, and knowledge retrieval systems. Further, it would provide research ideas for scientists, medical decision-making references for doctors, and even guide patients on disease and health management. Therefore, it is essential to provide great CNER results.

Objective:

This paper aims to propose a Chinese CNER method to learn semantic-enriched representations for comprehensively enhancing machines to understand deep semantic information of EMRs by using multimodal features, which makes medical information more readable and more understandable.

Methods:

Firstly, we used RoBERTa-wwm with dynamic fusion and Chinese character features, including Five-stroke code, Zheng code, Phonological code, and Stroke code, extracted by one-dimensional convolutional neural networks (CNN) to obtain Chinese character semantic features. Subsequently, we converted Chinese characters into square images to obtain Chinese character image features by using two-dimensional CNN. Finally, we input multimodal features into Bidirectional Long Short-Term Memory with Conditional Random Fields (BiLSTM-CRF) to achieve Chinese CNER. Our model's effectiveness is compared with the baseline and existing research models, and the features involved in the model are ablated and analyzed to verify the model's effectiveness.

Results:

We collected 1,397 CCKS-2019 EMRs containing 23,655 entities in six categories, and 2,007 Self-annotated EMRs containing 118,643 entities in seven categories. The experiments showed that our model outperformed the comparison experiments, with F1 values of 89.28% and 84.61% on the Yidu-S4K dataset and the Self-annotated dataset, respectively. The results of the ablation analysis demonstrated that each feature and method we used could improve the entity recognition ability.

Conclusions:

Our proposed CNER method would mine the richer deep semantic information in EMRs by multimodal embedding using RoBERTa-wwm and CNN, enhancing the semantic recognition of characters at different granularity levels, and improving the generalization capability of the method by achieving information complementarity among different modalities, thus making machine semantically understand EMRs and improving the CNER task accuracy.


 Citation

Please cite as:

Wang W, Li X, Ren H, Gao D, Fang A

Chinese Clinical Named Entity Recognition From Electronic Medical Records Based on Multisemantic Features by Using Robustly Optimized Bidirectional Encoder Representation From Transformers Pretraining Approach Whole Word Masking and Convolutional Neural Networks: Model Development and Validation

JMIR Med Inform 2023;11:e44597

DOI: 10.2196/44597

PMID: 37163343

PMCID: 10209791

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.