JMIR Preprints #44597: Chinese Clinical Named Entity Recognition from Electronic Medical Records based on Multi-semantic Features by using RoBERTa-wwm and CNN: Model Development and Validation

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Chinese Clinical Named Entity Recognition from Electronic Medical Records based on Multi-semantic Features by using RoBERTa-wwm and CNN: Model Development and Validation

Weijie Wang;
Xiaoying Li;
Huiling Ren;
Dongping Gao;
An Fang

ABSTRACT

Background:

Clinical electronic medical records (EMRs) contain important medical information expressing patients' anatomy, symptoms, examinations, diagnoses, and medications. Large-scale mining of rich medical information from massive amounts of electronic medical record data will have significant reference value for medical research. With the complexity of Chinese grammar and the blurred boundaries of Chinese words, Chinese Clinical Named Entity Recognition (CNER) remains a significant challenge. The follow-up tasks, like medical entity structuring, medical entity standardization, medical entity relationship extraction, and medical knowledge graph construction, largely depend on the medical named entity recognition effects. A promising CNER result would provide reliable support for building domain knowledge graphs, knowledge bases, and knowledge retrieval systems. Further, it would provide research ideas for scientists, medical decision-making references for doctors, and even guide patients on disease and health management. Therefore, it is essential to provide great CNER results.

Objective:

This paper aims to propose a Chinese CNER method to learn semantic-enriched representations for comprehensively enhancing machines to understand deep semantic information of EMRs by using multimodal features, which makes medical information more readable and more understandable.

Methods:

Firstly, we used RoBERTa-wwm with dynamic fusion and Chinese character features, including Five-stroke code, Zheng code, Phonological code, and Stroke code, extracted by one-dimensional convolutional neural networks (CNN) to obtain Chinese character semantic features. Subsequently, we converted Chinese characters into square images to obtain Chinese character image features by using two-dimensional CNN. Finally, we input multimodal features into Bidirectional Long Short-Term Memory with Conditional Random Fields (BiLSTM-CRF) to achieve Chinese CNER. Our model's effectiveness is compared with the baseline and existing research models, and the features involved in the model are ablated and analyzed to verify the model's effectiveness.

Results:

We collected 1,397 CCKS-2019 EMRs containing 23,655 entities in six categories, and 2,007 Self-annotated EMRs containing 118,643 entities in seven categories. The experiments showed that our model outperformed the comparison experiments, with F1 values of 89.28% and 84.61% on the Yidu-S4K dataset and the Self-annotated dataset, respectively. The results of the ablation analysis demonstrated that each feature and method we used could improve the entity recognition ability.

Conclusions:

Our proposed CNER method would mine the richer deep semantic information in EMRs by multimodal embedding using RoBERTa-wwm and CNN, enhancing the semantic recognition of characters at different granularity levels, and improving the generalization capability of the method by achieving information complementarity among different modalities, thus making machine semantically understand EMRs and improving the CNER task accuracy.

Citation

Please cite as:

Wang W, Li X, Ren H, Gao D, Fang A

Chinese Clinical Named Entity Recognition From Electronic Medical Records Based on Multisemantic Features by Using Robustly Optimized Bidirectional Encoder Representation From Transformers Pretraining Approach Whole Word Masking and Convolutional Neural Networks: Model Development and Validation

JMIR Med Inform 2023;11:e44597

DOI: 10.2196/44597

PMID: 37163343

PMCID: 10209791

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Dec 1, 2022

Date Accepted: Mar 31, 2023

Chinese Clinical Named Entity Recognition from Electronic Medical Records based on Multi-semantic Features by using RoBERTa-wwm and CNN: Model Development and Validation

ABSTRACT

Citation

Copyright