Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Jan 24, 2025
Open Peer Review Period: Feb 4, 2025 - Apr 1, 2025
Date Accepted: Apr 20, 2025
(closed for review but you can still tweet)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Detecting Redundant Health Survey Questions Using Language-agnostic BERT Sentence Embedding (LaBSE)
ABSTRACT
Background:
As the importance of PGHD in healthcare and research has increased, efforts to standardize survey-based PGHD to improve its usability and interoperability have been made. Standardization efforts, such as the Patient-Reported Outcomes Measurement Information System (PROMIS) and the NIH Common Data Elements (CDE) repository, provided effective tools for managing and unifying health survey questions. However, Previous methods using ontology-mediated annotation are not only labor-intensive and difficult to scale, but also face challenges in identifying semantic redundancies in survey questions, especially across multiple languages.
Objective:
The goal of this work was to compute the semantic similarity among publicly available health survey questions in order to facilitate the standardization of survey-based PGHD.
Methods:
We compiled various health survey questions authored in both English and Korean from the NIH CDE Repository, PROMIS, Korean public health agencies, and academic publications. Questions were drawn from various health lifelog domains. A randomized question pairing scheme was used to generate a Semantic Text Similarity (STS) dataset consisting of 1758 question pairs. Similarity scores between each question pair were assigned by two human experts. The tagged dataset was then used to build four classifiers featuring: Bag-of-Words, SBERT with BERT-based embeddings, SBRET with LaBSE embeddings, and GPT-4o. The algorithms were evaluated using traditional contingency statistics.
Results:
Among the three algorithms, SBERT-LaBSE demonstrated the highest performance in assessing question similarity across both languages, achieving an Area Under the Receiver Operating Characteristic (ROC) and Precision-Recall Curves of over 0.99. Additionally, it proved effective in identifying cross-lingual semantic similarities.
Conclusions:
This study introduces the SBERT-LaBSE algorithm for calculating semantic similarity across two languages, showing it outperforms BERT-based models, GPT-4o model and Bag of Words approach, highlighting its potential to improve semantic interoperability of survey-based PGHD across language barriers.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.