Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Jan 20, 2020
Date Accepted: Apr 19, 2020

The final, peer-reviewed published version of this preprint can be found here:

Discovering the Context of People With Disabilities: Semantic Categorization Test and Environmental Factors Mapping of Word Embeddings from Reddit

Garcia-Rudolph A, Saurí J, Cegarra B, Bernabeu Guitart M

Discovering the Context of People With Disabilities: Semantic Categorization Test and Environmental Factors Mapping of Word Embeddings from Reddit

JMIR Med Inform 2020;8(11):e17903

DOI: 10.2196/17903

PMID: 33216006

PMCID: 7718084

Discovering the Context of People With Disabilities: Semantic Categorization Test and Environmental Factors Mapping of Word Embeddings from Reddit

  • Alejandro Garcia-Rudolph; 
  • Joan Saurí; 
  • Blanca Cegarra; 
  • Montserrat Bernabeu Guitart

ABSTRACT

Background:

The World Health Organization’s International Classification of Functioning Disability and Health (ICF) conceptualizes disability not solely as a problem that resides in the individual, but as a health experience that occurs in a context. Word embeddings build on the idea that words that occur in similar contexts tend to have similar meanings. In spite of both sharing context as a key component, word embeddings have been scarcely applied in disability. In this work we propose Social Media (Reddit) to link them.

Objective:

1) Extract all comments and submissions from the disability subreddit. 2) Train a word2vec model using the disability subreddit comments, perform a preliminary validation using a subset of Mikolov’s analogies 3) Perform the Semantic Categorization test using an updated version of the Battig and Montague norm, (60 categories). For each category compute the Silhouette Coefficient (s) of the model 4) For each of the 5 ICF Environmental Factors (EF), select representative subcategories addressing different aspects of daily living (ADLs), for each subcategory identify specific terms extracted from their formal ICF definition, run the word2vec model to generate their nearest semantic terms. Validate the obtained nearest semantic terms using public evidence. 5) Apply the model to a specific subcategory of an EF, involved in a relevant use case in the field of rehabilitation.

Methods:

Reddit data were collected from pushshift.io, with the pushshiftr R package as wrapper, word2vec model was trained with wordVectors R package. We used Van Overschelde’s updated and expanded version of the Battig and Montague norms for semantic categories test. Silhouette coefficients were calculated using cosine distance from wordVectors R package.

Results:

We analyzed 96,314 comments posted during February 2009-December 2019, by 10,411 redditors. We trained word2vec and identified more than 30 analogies (e.g. breakfast – 8am + 8pm = dinner). Semantic categorization test over 60 categories showed promising results: e.g. s(A relative)=0.562, s(A sport)=0.475, providing remarkable explanations for low s values. We mapped representative subcategories of all EF chapters, obtaining closest terms for each, that we confirmed with publications, allowing immediate access (≤ 2 seconds) to terms related to ADLs ranging from apps "to know accessibility before you go" to adapted sports (boccia). As use case, for the Support and relationships EF subcategory, the closest term discovered by our model was resilience, recently regarded as a key feature of rehabilitation, not yet having one unified definition. Our model discovered 10 closest terms, which we validated with publications, contributing to resilience definition.

Conclusions:

This study opens up interesting opportunities for exploration and discovery using for the first time a word2vec model trained with a small disability dataset, leading to immediate, accurate and unknown (for authors in many cases) terms related to ADLs, within the ICF framework.


 Citation

Please cite as:

Garcia-Rudolph A, Saurí J, Cegarra B, Bernabeu Guitart M

Discovering the Context of People With Disabilities: Semantic Categorization Test and Environmental Factors Mapping of Word Embeddings from Reddit

JMIR Med Inform 2020;8(11):e17903

DOI: 10.2196/17903

PMID: 33216006

PMCID: 7718084

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.