Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Jul 31, 2020
Date Accepted: Oct 13, 2020

The final, peer-reviewed published version of this preprint can be found here:

Identification of Semantically Similar Sentences in Clinical Notes: Iterative Intermediate Training Using Multi-Task Learning

Mahajan D, Poddar A, Liang JJ, Lin YT, Prager JM, Suryanarayanan P, Raghavan P, Tsou CH

Identification of Semantically Similar Sentences in Clinical Notes: Iterative Intermediate Training Using Multi-Task Learning

JMIR Med Inform 2020;8(11):e22508

DOI: 10.2196/22508

PMID: 33245284

PMCID: 7732709

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Identification of Semantically Similar Sentences in Clinical Notes: Iterative Intermediate Training using Multi-Task Learning

  • Diwakar Mahajan; 
  • Ananya Poddar; 
  • Jennifer J Liang; 
  • Yen-Ting Lin; 
  • John M Prager; 
  • Parthasarathy Suryanarayanan; 
  • Preethi Raghavan; 
  • Ching-Huei Tsou

ABSTRACT

Background:

Although electronic health records (EHR) have been widely adopted in healthcare, effective use of EHR data is often limited due to redundant information in clinical notes introduced by use of templates and copy-paste during note generation. Thus, it is imperative to develop solutions that can condense information while retaining its value. A step in this direction is measuring the semantic similarity between clinical text snippets. To address this problem, we participated in the 2019 National NLP Clinical Challenges (n2c2) / Open Health Natural Language Processing Consortium (OHNLP) Clinical Semantic Textual Similarity (ClinicalSTS) shared task.

Objective:

This study aims to improve the performance and robustness of semantic textual similarity in the clinical domain by leveraging manually labeled data from related tasks and contextualized embeddings from pretrained transformer-based language models.

Methods:

The ClinicalSTS dataset consists of 1,642 pairs of de-identified clinical text snippets annotated in a continuous scale of 0-5 indicating degrees of semantic similarity. We developed Iterative Intermediate Training approach using Multi-Task Learning (IIT-MTL), a multi-task training approach that employs iterative dataset selection. We applied this process on ClinicalBERT, a pretrained domain-specific transformer-based language model, and fine-tuned the resulting model on the target ClinicalSTS task. We incrementally ensembled the output from applying IIT-MTL on ClinicalBERT with (1) output of other language models (BioBERT, MT-DNN, RoBERTa) and (2) hand-crafted features using regression-based learning algorithms. Based on these experiments, we adopted the top performing configurations as our official submissions.

Results:

Our system placed first out of 87 submitted systems in the 2019 n2c2/OHNLP ClinicalSTS challenge, achieving state-of-the-art results with a Pearson correlation coefficient of 0.9010. This winning system was an ensembled model leveraging the output of IIT-MTL on ClinicalBERT with BioBERT, MT-DNN and hand-crafted medication features.

Conclusions:

This study demonstrates that IIT-MTL is an effective way to leverage annotated data from related tasks to improve performance on a target task with a limited dataset. This contribution opens to the community new avenues of exploration for optimized dataset selection in order to generate more robust and universal contextual representations of text in clinical domain.


 Citation

Please cite as:

Mahajan D, Poddar A, Liang JJ, Lin YT, Prager JM, Suryanarayanan P, Raghavan P, Tsou CH

Identification of Semantically Similar Sentences in Clinical Notes: Iterative Intermediate Training Using Multi-Task Learning

JMIR Med Inform 2020;8(11):e22508

DOI: 10.2196/22508

PMID: 33245284

PMCID: 7732709

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.