JMIR Preprints #23101: Incorporating Domain Knowledge Into Language Models Using Graph Convolutional Networks for Clinical Semantic Textual Similarity

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Incorporating Domain Knowledge Into Language Models Using Graph Convolutional Networks for Clinical Semantic Textual Similarity

ABSTRACT

Background:

While electronic health record systems have facilitated clinical documentation in healthcare, they also introduce new challenges such as the proliferation of redundant information through copy-and-paste commands or templates. One approach to trim down bloated clinical documentation and improve clinical summarization is to identify highly similar text snippets for the goal of removing such text.

Objective:

We develop a natural language processing system for the task of clinical semantic textual similarity that assigns scores to pairs of clinical text snippets based on their clinical semantic similarity.

Methods:

We leverage recent advances in natural language processing and graph representation learning to create a model that combines linguistic and domain knowledge information from the MedSTS dataset to assess clinical semantic textual similarity. We use Bidirectional Encoder Representation from Transformers (BERT)¬–based models as text encoders for the sentence pairs in the dataset and graph convolutional networks (GCNs) as graph encoders for corresponding concept graphs constructed based on the sentences. We also explore techniques including data augmentation, ensembling, and knowledge distillation to improve the performance as measured by Pearson correlation.

Results:

Fine–tuning BERT-base and ClinicalBERT on the MedSTS dataset provided a strong baseline (0.842 and 0.848 Pearson correlation, respectively) compared to the previous year’s submissions. Our data augmentation techniques yielded moderate gains in performance, and adding a GCN–based graph encoder to incorporate the concept graphs also boosted performance, especially when the node features were initialized with pretrained knowledge graph embeddings of the concepts (0.868). As expected, ensembling improved performance, and multi–source ensembling using different language model variants, conducting knowledge distillation on the multi–source ensemble model, and taking a final ensemble of the distilled models further improved the system’s performance (0.875, 0.878, and 0.882, respectively).

Conclusions:

We develop a system for the MedSTS clinical semantic textual similarity benchmark task by combining BERT–based text encoders and GCN–based graph encoders in order to incorporate domain knowledge into the natural language processing pipeline. We also experiment with other techniques involving data augmentation, pretrained concept embeddings, ensembling, and knowledge distillation to further increase our performance.

Citation

Please cite as:

Incorporating Domain Knowledge Into Language Models by Using Graph Convolutional Networks for Assessing Semantic Textual Similarity: Model Development and Performance Comparison

JMIR Med Inform 2021;9(11):e23101

DOI: 10.2196/23101

PMID: 34842531

PMCID: 8665398

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Jul 31, 2020

Date Accepted: Jan 14, 2021

Date Submitted to PubMed: Nov 29, 2021

Incorporating Domain Knowledge Into Language Models Using Graph Convolutional Networks for Clinical Semantic Textual Similarity

ABSTRACT

Citation

Copyright