Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.
Who will be affected?
Readers: No access to all 28 journals. We recommend accessing our articles via PubMed Central
Authors: No access to the submission form or your user account.
Reviewers: No access to your user account. Please download manuscripts you are reviewing for offline reading before Wednesday, July 01, 2020 at 7:00 PM.
Editors: No access to your user account to assign reviewers or make decisions.
Copyeditors: No access to user account. Please download manuscripts you are copyediting before Wednesday, July 01, 2020 at 7:00 PM.
Towards developing a consumer health vocabulary by mining health forum texts based on word embedding: a semi-automatic approach
Gen Gu;
Xingting Zhang;
Xingeng Zhu;
Zhe Jian;
Ken Chen;
Dong Wen;
Li Gao;
Shaodian Zhang;
Fei Wang;
Handong Ma;
Jianbo Lei
ABSTRACT
Background:
Vocabulary gap between consumers and professionals in the medical domain hinders information seeking and communication. Consumer health vocabularies (CHVs) have been developed to aid such informatics applications. This purpose is best served if the vocabulary evolves with consumers’ language.
Objective:
Our objective is to develop a method for identifying and adding new terms to CHV, so it can keep up with the constantly evolving medical knowledge and language use.
Methods:
A consumer health terms finding framework based on distributed word vector space model is proposed. We first learn word vectors from large-scale text corpus and then adopt a supervised method with existing CHVs for learning vector representation of words, which can provide additional supervised fine tuning after unsupervised word embedding learning. With a fine-tuned word vector space, we identify pairs of professional terms and their consumer variants by their semantic distance in the vector space. A subsequent manual review of the extracted and labeled pairs of entities was conducted to validate the results generated by the proposed approach.
Results:
The results are evaluated using mean reciprocal rank (MRR). After manual evaluation, results show that it is feasible to identify alternative medical concepts by using professional or consumer concepts as queries in the word vector space without fine tuning, and the results are more promising in final fine-tuned word vector space. The MRR values indicate that on average, a professional or consumer concept is about 14th closest to its counterpart in word vector space without fine tuning, and the MMR in final fine-tuned word vector space is 8. Furthermore, the results demonstrate that our method is able to collect abbreviations and common typos frequently used by consumers.
Conclusions:
By integrating a large amount of text information and existing CHVs, our method outperformed several baseline ranking methods and is effective for generating a list of candidate terms for human review during CHV development.
Citation
Please cite as:
Gu G, Zhang X, Zhu X, Jian Z, Chen K, Wen D, Gao L, Zhang S, Wang F, Ma H, Lei J
Development of a Consumer Health Vocabulary by Mining Health Forum Texts Based on Word Embedding: Semiautomatic Approach