JMIR Preprints #71687: Detecting Redundant Health Survey Questions Using Language-agnostic BERT Sentence Embedding (LaBSE)

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Detecting Redundant Health Survey Questions Using Language-agnostic BERT Sentence Embedding (LaBSE)

Kang Sunghoon;
Hyeoneui Kim;
Hyewon Park;
Ricky Taira

ABSTRACT

Background:

As the importance of PGHD in healthcare and research has increased, efforts to standardize survey-based PGHD to improve its usability and interoperability have been made. Standardization efforts, such as the Patient-Reported Outcomes Measurement Information System (PROMIS) and the NIH Common Data Elements (CDE) repository, provided effective tools for managing and unifying health survey questions. However, Previous methods using ontology-mediated annotation are not only labor-intensive and difficult to scale, but also face challenges in identifying semantic redundancies in survey questions, especially across multiple languages.

Objective:

The goal of this work was to compute the semantic similarity among publicly available health survey questions in order to facilitate the standardization of survey-based PGHD.

Methods:

We compiled various health survey questions authored in both English and Korean from the NIH CDE Repository, PROMIS, Korean public health agencies, and academic publications. Questions were drawn from various health lifelog domains. A randomized question pairing scheme was used to generate a Semantic Text Similarity (STS) dataset consisting of 1758 question pairs. Similarity scores between each question pair were assigned by two human experts. The tagged dataset was then used to build four classifiers featuring: Bag-of-Words, SBERT with BERT-based embeddings, SBRET with LaBSE embeddings, and GPT-4o. The algorithms were evaluated using traditional contingency statistics.

Results:

Among the three algorithms, SBERT-LaBSE demonstrated the highest performance in assessing question similarity across both languages, achieving an Area Under the Receiver Operating Characteristic (ROC) and Precision-Recall Curves of over 0.99. Additionally, it proved effective in identifying cross-lingual semantic similarities.

Conclusions:

This study introduces the SBERT-LaBSE algorithm for calculating semantic similarity across two languages, showing it outperforms BERT-based models, GPT-4o model and Bag of Words approach, highlighting its potential to improve semantic interoperability of survey-based PGHD across language barriers.

Citation

Please cite as:

Sunghoon K, Kim H, Park H, Taira R

Detecting Redundant Health Survey Questions by Using Language-Agnostic Bidirectional Encoder Representations From Transformers Sentence Embedding: Algorithm Development Study

JMIR Med Inform 2025;13:e71687

DOI: 10.2196/71687

PMID: 40493668

PMCID: 12173092

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Jan 24, 2025

Open Peer Review Period: Feb 4, 2025 - Apr 1, 2025

Date Accepted: Apr 20, 2025

(closed for review but you can still tweet)

Detecting Redundant Health Survey Questions Using Language-agnostic BERT Sentence Embedding (LaBSE)

ABSTRACT

Citation

Copyright