Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Nov 6, 2018
Date Accepted: Apr 5, 2019
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Development of a Consumer Health Vocabulary by Mining Health Forum Texts Based on Word Embedding: Semiautomatic Approach

Gu G, Zhang X, Zhu X, Jian Z, Chen K, Wen D, Gao L, Zhang S, Wang F, Ma H, Lei J

Development of a Consumer Health Vocabulary by Mining Health Forum Texts Based on Word Embedding: Semiautomatic Approach

JMIR Med Inform 2019;7(2):e12704

DOI: 10.2196/12704

PMID: 31124461

PMCID: 6552449

Towards developing a consumer health vocabulary by mining health forum texts based on word embedding: a semi-automatic approach

  • Gen Gu; 
  • Xingting Zhang; 
  • Xingeng Zhu; 
  • Zhe Jian; 
  • Ken Chen; 
  • Dong Wen; 
  • Li Gao; 
  • Shaodian Zhang; 
  • Fei Wang; 
  • Handong Ma; 
  • Jianbo Lei

ABSTRACT

Background:

Vocabulary gap between consumers and professionals in the medical domain hinders information seeking and communication. Consumer health vocabularies (CHVs) have been developed to aid such informatics applications. This purpose is best served if the vocabulary evolves with consumers’ language.

Objective:

Our objective is to develop a method for identifying and adding new terms to CHV, so it can keep up with the constantly evolving medical knowledge and language use.

Methods:

A consumer health terms finding framework based on distributed word vector space model is proposed. We first learn word vectors from large-scale text corpus and then adopt a supervised method with existing CHVs for learning vector representation of words, which can provide additional supervised fine tuning after unsupervised word embedding learning. With a fine-tuned word vector space, we identify pairs of professional terms and their consumer variants by their semantic distance in the vector space. A subsequent manual review of the extracted and labeled pairs of entities was conducted to validate the results generated by the proposed approach.

Results:

The results are evaluated using mean reciprocal rank (MRR). After manual evaluation, results show that it is feasible to identify alternative medical concepts by using professional or consumer concepts as queries in the word vector space without fine tuning, and the results are more promising in final fine-tuned word vector space. The MRR values indicate that on average, a professional or consumer concept is about 14th closest to its counterpart in word vector space without fine tuning, and the MMR in final fine-tuned word vector space is 8. Furthermore, the results demonstrate that our method is able to collect abbreviations and common typos frequently used by consumers.

Conclusions:

By integrating a large amount of text information and existing CHVs, our method outperformed several baseline ranking methods and is effective for generating a list of candidate terms for human review during CHV development.


 Citation

Please cite as:

Gu G, Zhang X, Zhu X, Jian Z, Chen K, Wen D, Gao L, Zhang S, Wang F, Ma H, Lei J

Development of a Consumer Health Vocabulary by Mining Health Forum Texts Based on Word Embedding: Semiautomatic Approach

JMIR Med Inform 2019;7(2):e12704

DOI: 10.2196/12704

PMID: 31124461

PMCID: 6552449

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.