JMIR Preprints #23375: Overview of the 2019 n2c2/OHNLP Track on Clinical Semantic Textual Similarity

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Overview of the 2019 n2c2/OHNLP Track on Clinical Semantic Textual Similarity

Yanshan Wang;
Sunyang Fu;
Feichen Shen;
Sam Henry;
Ozlem Uzuner;
Hongfang Liu

ABSTRACT

Background:

Semantic textual similarity (STS) is a common task in general English domain to assess the degree to which the underlying semantics of two segments text are equivalent to each other. Clinical Semantic Textual Similarity (ClinicalSTS) is the STS task in the clinical domain that attempts to measure the degree of semantic equivalence between two snippets of clinical text. Due to the frequent use of templates, a large amount of redundant text exists in clinical notes, making ClinicalSTS crucial for the secondary use of clinical text in the downstream clinical natural language processing (NLP) applications, such as clinical text summarization, clinical semantics extraction, and clinical information retrieval.

Objective:

To release ClinicalSTS datasets and to motivate NLP and biomedical informatics communities to tackle STS tasks in the clinical domain.

Methods:

We organized the first BioCreative/OHNLP ClinicalSTS shared task in 2018 by making available a real-world clinical note dataset. We continued the shared task in 2019 in collaboration with n2c2 and OHNLP consortium, and organized the 2019 n2c2/OHNLP ClinicalSTS track. We released a larger ClinicalSTS dataset comprising a total of 1,642 clinical sentence pairs, including 1,068 pairs from the 2018 shared task as well as 1,006 new pairs from two EHR systems, GE and Epic. 80% of the data were released to participating teams to develop and tune STS systems whereas the remaining 20% were used as blind testing to evaluate their systems.

Results:

The n2c2/OHNLP ClinicalSTS shared task attracted 78 international teams to sign up, among which 33 participating teams produced a total of 87 valid system submissions. The top three systems were generated by IBM Research, National Center for Biotechnology Information, and University of Florida with Pearson correlation scores of 0.901, 0.8967, and 0.8864, respectively. The workshop was held in conjunction with the AMIA 2019 Symposium conference.

Conclusions:

The 2019 n2c2/OHNLP ClinicalSTS shared task focuses on computing semantic similarity for clinical text sentences generated from clinical notes in the real world. It attracted a large number of international teams. Most top performing systems used the state-of-the-art neural language models, such as BERT and XLNet, and the state-of-the-art training schemas in deep learning, such as pre-training and fine-tuning schema, and multi-task learning. We also found that overall the participating systems performed better on the Epic sentence pairs than on the GE sentence pairs, despite a much larger portion of the training data are GE sentence pairs. The ClinicalSTS shared task could continue to serve as a venue for researchers in NLP and medical informatics communities to develop and improve STS techniques for clinical text.

Citation

Please cite as:

Wang Y, Fu S, Shen F, Henry S, Uzuner O, Liu H

The 2019 n2c2/OHNLP Track on Clinical Semantic Textual Similarity: Overview

JMIR Med Inform 2020;8(11):e23375

DOI: 10.2196/23375

PMID: 33245291

PMCID: 7732706

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Aug 10, 2020

Open Peer Review Period: Aug 17, 2020 - Oct 17, 2020

Date Accepted: Nov 3, 2020

(closed for review but you can still tweet)

Overview of the 2019 n2c2/OHNLP Track on Clinical Semantic Textual Similarity

ABSTRACT

Citation

Copyright