JMIR Preprints #12704: Towards developing a consumer health vocabulary by mining health forum texts based on word embedding: a semi-automatic approach

Current Preprint Settings

(as selected by the authors)

1. Allow access to the preprint PDF upon submission to:

(a) Open peer-review purposes
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) Nobody

2. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) Nobody

3. When a final paper is published in a JMIR journal, display the preprint as follows:

(a) Allow download
(b) Show abstract only
(c) Do not display anything

4. If the paper is rejected from JMIR journals, display the preprint to:

(a) Logged-in users only
(b) Anybody, anytime
(c) Nobody

Towards developing a consumer health vocabulary by mining health forum texts based on word embedding: a semi-automatic approach

Gen Gu;
Xingting Zhang;
Xingeng Zhu;
Zhe Jian;
Ken Chen;
Dong Wen;
Li Gao;
Shaodian Zhang;
Fei Wang;
Handong Ma;
Jianbo Lei

ABSTRACT

Background:

Vocabulary gap between consumers and professionals in the medical domain hinders information seeking and communication. Consumer health vocabularies (CHVs) have been developed to aid such informatics applications. This purpose is best served if the vocabulary evolves with consumers’ language.

Objective:

Our objective is to develop a method for identifying and adding new terms to CHV, so it can keep up with the constantly evolving medical knowledge and language use.

Methods:

A consumer health terms finding framework based on distributed word vector space model is proposed. We first learn word vectors from large-scale text corpus and then adopt a supervised method with existing CHVs for learning vector representation of words, which can provide additional supervised fine tuning after unsupervised word embedding learning. With a fine-tuned word vector space, we identify pairs of professional terms and their consumer variants by their semantic distance in the vector space. A subsequent manual review of the extracted and labeled pairs of entities was conducted to validate the results generated by the proposed approach.

Results:

The results are evaluated using mean reciprocal rank (MRR). After manual evaluation, results show that it is feasible to identify alternative medical concepts by using professional or consumer concepts as queries in the word vector space without fine tuning, and the results are more promising in final fine-tuned word vector space. The MRR values indicate that on average, a professional or consumer concept is about 14th closest to its counterpart in word vector space without fine tuning, and the MMR in final fine-tuned word vector space is 8. Furthermore, the results demonstrate that our method is able to collect abbreviations and common typos frequently used by consumers.

Conclusions:

By integrating a large amount of text information and existing CHVs, our method outperformed several baseline ranking methods and is effective for generating a list of candidate terms for human review during CHV development.

Citation

Please cite as:

Gu G, Zhang X, Zhu X, Jian Z, Chen K, Wen D, Gao L, Zhang S, Wang F, Ma H, Lei J

Development of a Consumer Health Vocabulary by Mining Health Forum Texts Based on Word Embedding: Semiautomatic Approach

JMIR Med Inform 2019;7(2):e12704

DOI: 10.2196/12704

PMID: 31124461

PMCID: 6552449

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Nov 6, 2018

Date Accepted: Apr 5, 2019

(closed for review but you can still tweet)

Towards developing a consumer health vocabulary by mining health forum texts based on word embedding: a semi-automatic approach

ABSTRACT

Citation