Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Jul 31, 2020
Date Accepted: Oct 13, 2020
Identification of Semantically Similar Sentences in Clinical Notes: Iterative Intermediate Training using Multi-Task Learning
ABSTRACT
Background:
Although electronic health records (EHR) have been widely adopted in healthcare, effective use of EHR data is often limited due to redundant information in clinical notes introduced by use of templates and copy-paste during note generation. Thus, it is imperative to develop solutions that can condense information while retaining its value. A step in this direction is measuring the semantic similarity between clinical text snippets. To address this problem, we participated in the 2019 National NLP Clinical Challenges (n2c2) / Open Health Natural Language Processing Consortium (OHNLP) Clinical Semantic Textual Similarity (ClinicalSTS) shared task.
Objective:
This study aims to improve the performance and robustness of semantic textual similarity in the clinical domain by leveraging manually labeled data from related tasks and contextualized embeddings from pretrained transformer-based language models.
Methods:
The ClinicalSTS dataset consists of 1,642 pairs of de-identified clinical text snippets annotated in a continuous scale of 0-5 indicating degrees of semantic similarity. We developed Iterative Intermediate Training approach using Multi-Task Learning (IIT-MTL), a multi-task training approach that employs iterative dataset selection. We applied this process on ClinicalBERT, a pretrained domain-specific transformer-based language model, and fine-tuned the resulting model on the target ClinicalSTS task. We incrementally ensembled the output from applying IIT-MTL on ClinicalBERT with (1) output of other language models (BioBERT, MT-DNN, RoBERTa) and (2) hand-crafted features using regression-based learning algorithms. Based on these experiments, we adopted the top performing configurations as our official submissions.
Results:
Our system placed first out of 87 submitted systems in the 2019 n2c2/OHNLP ClinicalSTS challenge, achieving state-of-the-art results with a Pearson correlation coefficient of 0.9010. This winning system was an ensembled model leveraging the output of IIT-MTL on ClinicalBERT with BioBERT, MT-DNN and hand-crafted medication features.
Conclusions:
This study demonstrates that IIT-MTL is an effective way to leverage annotated data from related tasks to improve performance on a target task with a limited dataset. This contribution opens to the community new avenues of exploration for optimized dataset selection in order to generate more robust and universal contextual representations of text in clinical domain.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.