Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Jan 22, 2021
Date Accepted: Aug 6, 2021

The final, peer-reviewed published version of this preprint can be found here:

Benchmarking Effectiveness and Efficiency of Deep Learning Models for Semantic Textual Similarity in the Clinical Domain: Validation Study

Chen Q, Rankine A, Peng Y, Lu Z

Benchmarking Effectiveness and Efficiency of Deep Learning Models for Semantic Textual Similarity in the Clinical Domain: Validation Study

JMIR Med Inform 2021;9(12):e27386

DOI: 10.2196/27386

PMID: 34967748

PMCID: 8759018

Benchmarking effectiveness and efficiency of deep-learning models for semantic textual similarity in the clinical domain

  • Qingyu Chen; 
  • Alex Rankine; 
  • Yifan Peng; 
  • Zhiyong Lu

ABSTRACT

Background:

Semantic textual similarity (STS) is a measure of the degree of relatedness between sentence pairs. The Open Health Natural Language Processing Consortium (OHNLP) released an expertly annotated STS dataset and called for the National NLP Clinical Challenges (n2c2). A total of 87 models from 33 teams were submitted. This work describes our entry, an ensemble model developed by leveraging a range of deep-learning models. Our NLM/NCBI team obtained a Pearson correlation of 0.8967 in the official test set during the 2019 n2c2/OHNLP shared task, and achieved the rank of second.

Objective:

A Pearson correlation of 0.90 indicates models have a strong correlation with manual annotations. The observation motivates the potential use of deep-learning models in production systems. The annotator-level correlation, however, was only moderate (a weighted Cohen’s kappa of 0.60). Therefore, we urge caution in regard to the models’ extremely high correlation and argue that it is more critical to evaluate the models in greater depth. In this study, we benchmark the effectiveness and efficiency of top-ranked deep-learning models. Additionally, we quantify their robustness and inference times to validate whether the models could be used in real-time applications. The study is part of the 2019 n2c2/OHNLP shared task.

Methods:

We benchmarked five deep-learning models, which are the top-ranked systems for STS tasks: CNN, BioSentVec, BioBERT, BlueBERT, and ClinicalBERT. For each model, we repeated the experiment ten times, using the official training and testing sets. We report a 95% confidence interval of the Wilcoxon rank-sum test on the average Pearson correlation and running time. We further performed error analysis qualitatively at different similarity levels and qualitatively analyzed the erroneous cases.

Results:

By using only the official training set, all five models already had reasonable effectiveness results. BioSentVec and BioBERT achieved the highest average Pearson correlations (0.8497 and 0.8481, respectively). Their robustness to sentence pairs of different similarity levels, however, varies significantly. A particular observation is that BERT models made the most errors (a mean squared error of over 2.5) on highly similar sentence pairs. BERT models cannot capture highly similar sentence pairs effectively when they have different negation terms or word orders. Additionally, time efficiency is dramatically different from the effectiveness results. BERT models were 20 times and 50 times slower than were the CNN and BioSentVec models, on average, respectively. This results in challenges to real-time applications.

Conclusions:

Despite the excitement of further improving Pearson correlations in this dataset, the results highlight that evaluations of STS models’ effectiveness and efficiency are critical. We suggest more evaluations on the models’ generalization capability and user-level testing. We also call for community efforts to create more biomedical and clinical STS datasets from different perspectives to reflect the multifaceted notion of sentence relatedness.


 Citation

Please cite as:

Chen Q, Rankine A, Peng Y, Lu Z

Benchmarking Effectiveness and Efficiency of Deep Learning Models for Semantic Textual Similarity in the Clinical Domain: Validation Study

JMIR Med Inform 2021;9(12):e27386

DOI: 10.2196/27386

PMID: 34967748

PMCID: 8759018

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.