Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Jan 24, 2025
Open Peer Review Period: Feb 4, 2025 - Apr 1, 2025
Date Accepted: Apr 20, 2025
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Detecting Redundant Health Survey Questions by Using Language-Agnostic Bidirectional Encoder Representations From Transformers Sentence Embedding: Algorithm Development Study

Sunghoon K, Kim H, Park H, Taira R

Detecting Redundant Health Survey Questions by Using Language-Agnostic Bidirectional Encoder Representations From Transformers Sentence Embedding: Algorithm Development Study

JMIR Med Inform 2025;13:e71687

DOI: 10.2196/71687

PMID: 40493668

PMCID: 12173092

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Detecting Redundant Health Survey Questions Using Language-agnostic BERT Sentence Embedding (LaBSE)

  • Kang Sunghoon; 
  • Hyeoneui Kim; 
  • Hyewon Park; 
  • Ricky Taira

ABSTRACT

Background:

As the importance of PGHD in healthcare and research has increased, efforts to standardize survey-based PGHD to improve its usability and interoperability have been made. Standardization efforts, such as the Patient-Reported Outcomes Measurement Information System (PROMIS) and the NIH Common Data Elements (CDE) repository, provided effective tools for managing and unifying health survey questions. However, Previous methods using ontology-mediated annotation are not only labor-intensive and difficult to scale, but also face challenges in identifying semantic redundancies in survey questions, especially across multiple languages.

Objective:

The goal of this work was to compute the semantic similarity among publicly available health survey questions in order to facilitate the standardization of survey-based PGHD.

Methods:

We compiled various health survey questions authored in both English and Korean from the NIH CDE Repository, PROMIS, Korean public health agencies, and academic publications. Questions were drawn from various health lifelog domains. A randomized question pairing scheme was used to generate a Semantic Text Similarity (STS) dataset consisting of 1758 question pairs. Similarity scores between each question pair were assigned by two human experts. The tagged dataset was then used to build four classifiers featuring: Bag-of-Words, SBERT with BERT-based embeddings, SBRET with LaBSE embeddings, and GPT-4o. The algorithms were evaluated using traditional contingency statistics.

Results:

Among the three algorithms, SBERT-LaBSE demonstrated the highest performance in assessing question similarity across both languages, achieving an Area Under the Receiver Operating Characteristic (ROC) and Precision-Recall Curves of over 0.99. Additionally, it proved effective in identifying cross-lingual semantic similarities.

Conclusions:

This study introduces the SBERT-LaBSE algorithm for calculating semantic similarity across two languages, showing it outperforms BERT-based models, GPT-4o model and Bag of Words approach, highlighting its potential to improve semantic interoperability of survey-based PGHD across language barriers.


 Citation

Please cite as:

Sunghoon K, Kim H, Park H, Taira R

Detecting Redundant Health Survey Questions by Using Language-Agnostic Bidirectional Encoder Representations From Transformers Sentence Embedding: Algorithm Development Study

JMIR Med Inform 2025;13:e71687

DOI: 10.2196/71687

PMID: 40493668

PMCID: 12173092

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.