Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Jul 27, 2020
Date Accepted: Oct 26, 2020
Measuring Semantic Textual Similarity in Clinical Text: A Study of Transformer-based Models
ABSTRACT
Background:
Semantic textual similarity (STS) is one of the fundamental tasks in natural language processing (NLP). Many shared tasks and corpora for STS have been organized in the general English domain; yet, such resources are limited in the biomedical domain. In 2019, the n2c2 challenge developed a comprehensive clinical STS dataset and called for a community effort to solicit state-of-the-art solutions for clinical STS.
Objective:
Based on our participation in this challenge, this study presents our transformer-based clinical STS models developed during this challenge as well as new models we explored after the challenge. This project is part of the 2019 N2C2/OHNLP shared task on clinical STS.
Methods:
In this study, we explored three transformer-based models, including BERT, XLNet, and RoBERTa for clinical STS. We examined transformer models pre-trained using both general English text and clinical text. We also explored using a general English STS dataset as a supplementary corpus in addition to the clinical training set developed in this challenge. Furthermore, we also investigated various ensemble methods to combine different transformer models.
Results:
Our best submission based on the XLNet model achieved the third-best performance (Pearson correlation of 0.8864) in this challenge. After challenge, we further explored other transformer models and improved the performance to 0.9065 using a RoBERTa model, which outperformed the best-performed system developed in this challenge (correlation of 0.9010).
Conclusions:
This study demonstrated the efficiency of utilizing transformer-based models to measure semantic similarity for clinical text. Our models can be applied to clinical applications such as clinical text de-duplication and summarization.
Citation
Request queued. Please wait while the file is being generated. It may take some time.