JMIR Preprints #22508: Identification of Semantically Similar Sentences in Clinical Notes: Iterative Intermediate Training using Multi-Task Learning

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Identification of Semantically Similar Sentences in Clinical Notes: Iterative Intermediate Training using Multi-Task Learning

Diwakar Mahajan;
Ananya Poddar;
Jennifer J Liang;
Yen-Ting Lin;
John M Prager;
Parthasarathy Suryanarayanan;
Preethi Raghavan;
Ching-Huei Tsou

ABSTRACT

Background:

Although electronic health records (EHR) have been widely adopted in healthcare, effective use of EHR data is often limited due to redundant information in clinical notes introduced by use of templates and copy-paste during note generation. Thus, it is imperative to develop solutions that can condense information while retaining its value. A step in this direction is measuring the semantic similarity between clinical text snippets. To address this problem, we participated in the 2019 National NLP Clinical Challenges (n2c2) / Open Health Natural Language Processing Consortium (OHNLP) Clinical Semantic Textual Similarity (ClinicalSTS) shared task.

Objective:

This study aims to improve the performance and robustness of semantic textual similarity in the clinical domain by leveraging manually labeled data from related tasks and contextualized embeddings from pretrained transformer-based language models.

Methods:

The ClinicalSTS dataset consists of 1,642 pairs of de-identified clinical text snippets annotated in a continuous scale of 0-5 indicating degrees of semantic similarity. We developed Iterative Intermediate Training approach using Multi-Task Learning (IIT-MTL), a multi-task training approach that employs iterative dataset selection. We applied this process on ClinicalBERT, a pretrained domain-specific transformer-based language model, and fine-tuned the resulting model on the target ClinicalSTS task. We incrementally ensembled the output from applying IIT-MTL on ClinicalBERT with (1) output of other language models (BioBERT, MT-DNN, RoBERTa) and (2) hand-crafted features using regression-based learning algorithms. Based on these experiments, we adopted the top performing configurations as our official submissions.

Results:

Our system placed first out of 87 submitted systems in the 2019 n2c2/OHNLP ClinicalSTS challenge, achieving state-of-the-art results with a Pearson correlation coefficient of 0.9010. This winning system was an ensembled model leveraging the output of IIT-MTL on ClinicalBERT with BioBERT, MT-DNN and hand-crafted medication features.

Conclusions:

This study demonstrates that IIT-MTL is an effective way to leverage annotated data from related tasks to improve performance on a target task with a limited dataset. This contribution opens to the community new avenues of exploration for optimized dataset selection in order to generate more robust and universal contextual representations of text in clinical domain.

Citation

Please cite as:

Mahajan D, Poddar A, Liang JJ, Lin YT, Prager JM, Suryanarayanan P, Raghavan P, Tsou CH

Identification of Semantically Similar Sentences in Clinical Notes: Iterative Intermediate Training Using Multi-Task Learning

JMIR Med Inform 2020;8(11):e22508

DOI: 10.2196/22508

PMID: 33245284

PMCID: 7732709

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Jul 31, 2020

Date Accepted: Oct 13, 2020

Identification of Semantically Similar Sentences in Clinical Notes: Iterative Intermediate Training using Multi-Task Learning

ABSTRACT

Citation

Copyright