Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Aug 10, 2020
Open Peer Review Period: Aug 17, 2020 - Oct 17, 2020
Date Accepted: Nov 3, 2020
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

The 2019 n2c2/OHNLP Track on Clinical Semantic Textual Similarity: Overview

Wang Y, Fu S, Shen F, Henry S, Uzuner O, Liu H

The 2019 n2c2/OHNLP Track on Clinical Semantic Textual Similarity: Overview

JMIR Med Inform 2020;8(11):e23375

DOI: 10.2196/23375

PMID: 33245291

PMCID: 7732706

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Overview of the 2019 n2c2/OHNLP Track on Clinical Semantic Textual Similarity

  • Yanshan Wang; 
  • Sunyang Fu; 
  • Feichen Shen; 
  • Sam Henry; 
  • Ozlem Uzuner; 
  • Hongfang Liu

ABSTRACT

Background:

Semantic textual similarity (STS) is a common task in general English domain to assess the degree to which the underlying semantics of two segments text are equivalent to each other. Clinical Semantic Textual Similarity (ClinicalSTS) is the STS task in the clinical domain that attempts to measure the degree of semantic equivalence between two snippets of clinical text. Due to the frequent use of templates, a large amount of redundant text exists in clinical notes, making ClinicalSTS crucial for the secondary use of clinical text in the downstream clinical natural language processing (NLP) applications, such as clinical text summarization, clinical semantics extraction, and clinical information retrieval.

Objective:

To release ClinicalSTS datasets and to motivate NLP and biomedical informatics communities to tackle STS tasks in the clinical domain.

Methods:

We organized the first BioCreative/OHNLP ClinicalSTS shared task in 2018 by making available a real-world clinical note dataset. We continued the shared task in 2019 in collaboration with n2c2 and OHNLP consortium, and organized the 2019 n2c2/OHNLP ClinicalSTS track. We released a larger ClinicalSTS dataset comprising a total of 1,642 clinical sentence pairs, including 1,068 pairs from the 2018 shared task as well as 1,006 new pairs from two EHR systems, GE and Epic. 80% of the data were released to participating teams to develop and tune STS systems whereas the remaining 20% were used as blind testing to evaluate their systems.

Results:

The n2c2/OHNLP ClinicalSTS shared task attracted 78 international teams to sign up, among which 33 participating teams produced a total of 87 valid system submissions. The top three systems were generated by IBM Research, National Center for Biotechnology Information, and University of Florida with Pearson correlation scores of 0.901, 0.8967, and 0.8864, respectively. The workshop was held in conjunction with the AMIA 2019 Symposium conference.

Conclusions:

The 2019 n2c2/OHNLP ClinicalSTS shared task focuses on computing semantic similarity for clinical text sentences generated from clinical notes in the real world. It attracted a large number of international teams. Most top performing systems used the state-of-the-art neural language models, such as BERT and XLNet, and the state-of-the-art training schemas in deep learning, such as pre-training and fine-tuning schema, and multi-task learning. We also found that overall the participating systems performed better on the Epic sentence pairs than on the GE sentence pairs, despite a much larger portion of the training data are GE sentence pairs. The ClinicalSTS shared task could continue to serve as a venue for researchers in NLP and medical informatics communities to develop and improve STS techniques for clinical text.


 Citation

Please cite as:

Wang Y, Fu S, Shen F, Henry S, Uzuner O, Liu H

The 2019 n2c2/OHNLP Track on Clinical Semantic Textual Similarity: Overview

JMIR Med Inform 2020;8(11):e23375

DOI: 10.2196/23375

PMID: 33245291

PMCID: 7732706

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.