Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Feb 28, 2024
Date Accepted: Jul 1, 2024
Digital Epidemiology of Prescription Drug References on X/Twitter: Neural Network Topic Modeling and Sentiment Analysis
ABSTRACT
Background:
Data from the social media platform X (formerly Twitter) can provide insights into the types of language that are used when discussing drug use. In past research using latent Dirichlet allocation (LDA), we found that tweets containing “street names” of prescription drugs were difficult to classify due to the similarity to other colloquialisms and lack of clarity over how the terms were used. Conversely, “brand name” references were more amenable to machine-driven categorization.
Objective:
This study sought to use next generation techniques (beyond LDA) from natural language processing to re-process X data and automatically cluster groups of tweets into topics to differentiate between street and brand name datasets. We also aimed to analyze differences in emotional valence between the two datasets to study the relationship between engagement on social media and sentiment.
Methods:
We used the Twitter API (Application Programming Interface) to collect tweets that contained street and brand name prescription drug within the tweet. Using BERTtopic in combination with UMAP and k-nearest neighbors, we generated topics for the street name corpus (n=170,618) and brand name corpus (n=245,145). VADER scores were used to classify whether tweets within topics had positive, negative, or neutral sentiment. Two different logistic regression classifiers were used to predict the sentiment label within each corpus. The first model used a tweet’s engagement metrics and topic ID to predict the label, while the second model used those features in addition to the top 5,000 tweets with the largest term-frequency-inverse document frequency (TF-IDF) score.
Results:
Using BERTtopic, we identified 40 topics for the street name dataset and 5 topics for the brand name dataset which we generalized into 8 and 5 topics of discussion, respectively. Four of the general themes of discussion in the brand name corpus referenced drug use, while 2 themes of discussion in the street name referenced drug use. From the VADER scores, we found both corpora were inclined toward positive sentiment. Adding the vectorized tweet text increased the accuracy of our models by around 40% compared to the models that did not incorporate the tweet text in both corpuses. This resulted in models that were accurate approximately 86% of the time between corpuses.
Conclusions:
BERTtopic was able to classify tweets well. As with LDA, discussion using brand names was more similar between tweets than street name discussion. VADER scores could only be logically applied to the brand name corpus owing to the high prevalence of non-drug-related topics in the street name data. Brand-name tweets either discussed drugs positively or negative, with few posts having a neutral emotionality. From our machine learning models, engagement alone was not enough to predict the sentiment label; the added context from the tweets was needed to understand the emotionality of a tweet.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.