JMIR Preprints #27386: Benchmarking effectiveness and efficiency of deep-learning models for semantic textual similarity in the clinical domain

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Benchmarking effectiveness and efficiency of deep-learning models for semantic textual similarity in the clinical domain

Qingyu Chen;
Alex Rankine;
Yifan Peng;
Zhiyong Lu

ABSTRACT

Background:

Semantic textual similarity (STS) is a measure of the degree of relatedness between sentence pairs. The Open Health Natural Language Processing Consortium (OHNLP) released an expertly annotated STS dataset and called for the National NLP Clinical Challenges (n2c2). A total of 87 models from 33 teams were submitted. This work describes our entry, an ensemble model developed by leveraging a range of deep-learning models. Our NLM/NCBI team obtained a Pearson correlation of 0.8967 in the official test set during the 2019 n2c2/OHNLP shared task, and achieved the rank of second.

Objective:

A Pearson correlation of 0.90 indicates models have a strong correlation with manual annotations. The observation motivates the potential use of deep-learning models in production systems. The annotator-level correlation, however, was only moderate (a weighted Cohen’s kappa of 0.60). Therefore, we urge caution in regard to the models’ extremely high correlation and argue that it is more critical to evaluate the models in greater depth. In this study, we benchmark the effectiveness and efficiency of top-ranked deep-learning models. Additionally, we quantify their robustness and inference times to validate whether the models could be used in real-time applications. The study is part of the 2019 n2c2/OHNLP shared task.

Methods:

We benchmarked five deep-learning models, which are the top-ranked systems for STS tasks: CNN, BioSentVec, BioBERT, BlueBERT, and ClinicalBERT. For each model, we repeated the experiment ten times, using the official training and testing sets. We report a 95% confidence interval of the Wilcoxon rank-sum test on the average Pearson correlation and running time. We further performed error analysis qualitatively at different similarity levels and qualitatively analyzed the erroneous cases.

Results:

By using only the official training set, all five models already had reasonable effectiveness results. BioSentVec and BioBERT achieved the highest average Pearson correlations (0.8497 and 0.8481, respectively). Their robustness to sentence pairs of different similarity levels, however, varies significantly. A particular observation is that BERT models made the most errors (a mean squared error of over 2.5) on highly similar sentence pairs. BERT models cannot capture highly similar sentence pairs effectively when they have different negation terms or word orders. Additionally, time efficiency is dramatically different from the effectiveness results. BERT models were 20 times and 50 times slower than were the CNN and BioSentVec models, on average, respectively. This results in challenges to real-time applications.

Conclusions:

Despite the excitement of further improving Pearson correlations in this dataset, the results highlight that evaluations of STS models’ effectiveness and efficiency are critical. We suggest more evaluations on the models’ generalization capability and user-level testing. We also call for community efforts to create more biomedical and clinical STS datasets from different perspectives to reflect the multifaceted notion of sentence relatedness.

Citation

Please cite as:

Chen Q, Rankine A, Peng Y, Lu Z

Benchmarking Effectiveness and Efficiency of Deep Learning Models for Semantic Textual Similarity in the Clinical Domain: Validation Study

JMIR Med Inform 2021;9(12):e27386

DOI: 10.2196/27386

PMID: 34967748

PMCID: 8759018

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Jan 22, 2021

Date Accepted: Aug 6, 2021

Benchmarking effectiveness and efficiency of deep-learning models for semantic textual similarity in the clinical domain

ABSTRACT

Citation

Copyright