Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Nov 3, 2020
Date Accepted: Apr 19, 2021
Date Submitted to PubMed: Apr 23, 2021
Semantic Linkages of Obsessions: Clustering and Frequencies of Obsessional Symptoms from a Large International Obsessive-Compulsive Disorder Mobile Application Dataset
ABSTRACT
Background:
Obsessive-compulsive disorder (OCD) is characterized by recurrent intrusive thoughts, urges, or images (obsessions) and repetitive physical or mental behaviors (compulsions). While specific obsessions and compulsions can manifest in vastly different ways, previous factor analytic and clustering studies suggest the presence of three or four “subtypes” of OCD symptoms. Yet, these studies have relied on predefined symptom checklists, which are limited in breadth and may be biased towards researchers’ prior conceptualizations of OCD.
Objective:
As an alternative to uncovering potential OCD subtypes, we examined a large data set of freely-reported obsession symptoms obtained from an OCD mobile app. From this we examined data-driven clusters of obsessions based on their latent semantic relationships in the English language, using word embedding, a type of natural language processing.
Methods:
We extracted free-text entry words describing obsessions in a large sample of users of the mobile application, “NOCD,” who self-identified as having OCD. Semantic vector space modeling was applied using Global Vectors for Word Representation algorithm (GloVe), an unsupervised learning algorithm for obtaining vector representations for words based on word-word co-occurrence statistics from a 6 billion word corpus. A domain-specific extension, “Mittens,” was also applied to enhance the corpus with OCD-specific words. After cleaning the obsessions words, we created a word co-occurrence matrix. Resulting representations provided linear substructures of the word vector in 100-dimensional space. We applied principal components analysis to the 100-dimensional vector representation of the most frequent words, followed by k-means clustering to obtain clusters of related words.
Results:
We obtained unique 7,001 words representing obsessions from 25,369 individuals. Heuristics for determining optimal numbers of clusters pointed to a three-cluster solution, with themes relating to doubt/checking, contamination/somatic/physical harm/sexual harm, and relationship/just-right. All three clusters showed relatively close semantic relationships to each other in a central area of convergence, with themes relating to harm. An equal-sized split-sample analysis across individuals and a split-sample analysis over time both showed overall stable cluster solutions. Words in the contamination/somatic/physical harm/sexual harm cluster were the most frequently occurring, followed by words in the relationship/just-right cluster.
Conclusions:
Clustering of naturalistically-acquired obsessional words resulted in three major groupings of semantic themes, which partially overlap with previous studies’ results using predefined checklists. Further, the closeness of the overall embedded relationships across clusters and their central convergence on harm suggests that, at least at the level of self-reported obsessional thoughts, the majority of obsessions have close semantic relationships. Harm to self or others may be an underlying organizing theme across many obsessions. Notably, “relationship” themed words, not previously included in factor analytic studies, clustered with “just-right” words. These novel insights have potential implications for understanding how an apparent multitude of obsessional symptoms are connected by underlying themes. This could aid in exposure-based treatment approaches and could be used as a conceptual framework for future research.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.