JMIR Preprints #60334: Chinese Clinical Named Entity Recognition with Segmentation Synonym Sentence Synthesis Mechanism: Algorithm Development and Validation

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Chinese Clinical Named Entity Recognition with Segmentation Synonym Sentence Synthesis Mechanism: Algorithm Development and Validation

Jian Tang;
Zikun Huang;
Hongzhen Xu;
Hao Zhang;
Hailing Huang;
Minqiong Tang;
Pengsheng Luo;
Dong Qin

ABSTRACT

Background:

Clinical named entity recognition (CNER) is a fundamental task in natural language processing used to extract named entities from electronic medical record texts. In recent years, with the continuous development of machine learning, deep learning models have replaced traditional machine learning and template-based methods, becoming widely applied in the clinical NER field. However, due to the complexity of clinical texts, the diversity and large quantity of named entity types, and the unclear boundaries between different entities, existing advanced methods rely to some extent on annotated databases and the scale of embedded dictionaries.

Objective:

This study aims to address the issues of data scarcity and labeling difficulties in clinical named entity recognition tasks by proposing a dataset augmentation algorithm based on proximity word calculation.

Methods:

We propose a Segmentation Synonym Sentence Synthesis (SSSS) algorithm based on neighboring vocabulary, which leverages existing public knowledge without the need for manual expansion of specialized domain dictionaries. Through lexical segmentation, the algorithm replaces new synonymous vocabulary by recombining from vast natural language data, achieving nearby expansion expressions of the dataset. We applied the SSSS algorithm to the RoBERTa+CRF and RoBERTa+Bi-LSTM+CRF models and evaluated our models (SSSS+RoBERTa+CRF, SSSS+RoBERTa+BiLSTM+CRF) on the datasets CCKS-2017 and CCKS-2019.

Results:

SSSS algorithm successfully extends the documents of CCKS-2017 and CCKS-2019 by approximately 17 and 20 times, respectively. Our experiments demonstrate that the models SSSS+RoBERTa+CRF and SSSS+RoBERTa+BiLSTM+CRF achieved F1 scores of 91.30% and 91.35% on the CCKS-2017 dataset, respectively, and F1 scores of 83.21% and 83.01% on the CCKS-2019 dataset, respectively.

Conclusions:

The experimental results indicate that our proposed method successfully expanded the dataset and significantly improved the performance of the model, effectively addressing the challenges of data acquisition, annotation difficulties, and insufficient model generalization performance.

Citation

Please cite as:

Tang J, Huang Z, Xu H, Zhang H, Huang H, Tang M, Luo P, Qin D

Chinese Clinical Named Entity Recognition With Segmentation Synonym Sentence Synthesis Mechanism: Algorithm Development and Validation

JMIR Med Inform 2024;12:e60334

DOI: 10.2196/60334

PMID: 39622697

PMCID: 11612518

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: May 8, 2024

Open Peer Review Period: May 7, 2024 - Jul 2, 2024

Date Accepted: Oct 13, 2024

(closed for review but you can still tweet)

Chinese Clinical Named Entity Recognition with Segmentation Synonym Sentence Synthesis Mechanism: Algorithm Development and Validation

ABSTRACT

Citation

Copyright