Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Mar 9, 2022
Date Accepted: Jul 22, 2022
Search Term Identification Methods for Computational Health Communication: A Word Embedding and Network Approach for Health Content on YouTube
ABSTRACT
Background:
Some health communication content on social media platforms is likely to use colloquial language. Common methods for social media data retrieval in health communication contexts typically involve only technical language and medical vocabulary that may be unfamiliar to the public. Methods that leverage colloquial language have been use-case specific and there is no general process for specifically expanding standard terminology to colloquial terms.
Objective:
Motivated by this challenge, we put forward a search term identification method to improve health communication social media content retrieval, using cancer screening as a subject, and YouTube as a platform case study.
Methods:
We developed an approach that leveraged word embeddings trained on topic-specific text data to identify terms that are semantically similar to formal medical concepts. Computational textual analysis and network analysis were used to examine the newly identified videos for content novelty and connections with videos from the original concepts.
Results:
Terms with semantic similarities to cancer screening tests were identified via word2vec. These neighbor terms retrieved novel and contextually diverse content beyond the original content from the medical concepts, improving recall. Precision is maintained by calculating the network degrees of videos, which correlated with human judgment of whether the newly identified videos contained relevant content.
Conclusions:
We discussed the benefits of the technique regarding human coding resources and outlined suggestions to improve health-related content retrieval across social media platforms.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.