Accepted for/Published in: JMIR Medical Informatics
Date Submitted: May 8, 2024
Open Peer Review Period: May 7, 2024 - Jul 2, 2024
Date Accepted: Oct 13, 2024
(closed for review but you can still tweet)
Chinese Clinical Named Entity Recognition with Segmentation Synonym Sentence Synthesis Mechanism: Algorithm Development and Validation
ABSTRACT
Background:
Clinical named entity recognition (CNER) is a fundamental task in natural language processing used to extract named entities from electronic medical record texts. In recent years, with the continuous development of machine learning, deep learning models have replaced traditional machine learning and template-based methods, becoming widely applied in the clinical NER field. However, due to the complexity of clinical texts, the diversity and large quantity of named entity types, and the unclear boundaries between different entities, existing advanced methods rely to some extent on annotated databases and the scale of embedded dictionaries.
Objective:
This study aims to address the issues of data scarcity and labeling difficulties in clinical named entity recognition tasks by proposing a dataset augmentation algorithm based on proximity word calculation.
Methods:
We propose a Segmentation Synonym Sentence Synthesis (SSSS) algorithm based on neighboring vocabulary, which leverages existing public knowledge without the need for manual expansion of specialized domain dictionaries. Through lexical segmentation, the algorithm replaces new synonymous vocabulary by recombining from vast natural language data, achieving nearby expansion expressions of the dataset. We applied the SSSS algorithm to the RoBERTa+CRF and RoBERTa+Bi-LSTM+CRF models and evaluated our models (SSSS+RoBERTa+CRF, SSSS+RoBERTa+BiLSTM+CRF) on the datasets CCKS-2017 and CCKS-2019.
Results:
SSSS algorithm successfully extends the documents of CCKS-2017 and CCKS-2019 by approximately 17 and 20 times, respectively. Our experiments demonstrate that the models SSSS+RoBERTa+CRF and SSSS+RoBERTa+BiLSTM+CRF achieved F1 scores of 91.30% and 91.35% on the CCKS-2017 dataset, respectively, and F1 scores of 83.21% and 83.01% on the CCKS-2019 dataset, respectively.
Conclusions:
The experimental results indicate that our proposed method successfully expanded the dataset and significantly improved the performance of the model, effectively addressing the challenges of data acquisition, annotation difficulties, and insufficient model generalization performance.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.