Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Feb 28, 2024
Date Accepted: Jul 1, 2024

The final, peer-reviewed published version of this preprint can be found here:

Digital Epidemiology of Prescription Drug References on X (Formerly Twitter): Neural Network Topic Modeling and Sentiment Analysis

Rao VK, Valdez D, Muralidharan R, Agley J, Eddens KS, Dendukuri A, Panth V, Parker MA

Digital Epidemiology of Prescription Drug References on X (Formerly Twitter): Neural Network Topic Modeling and Sentiment Analysis

J Med Internet Res 2024;26:e57885

DOI: 10.2196/57885

PMID: 39178036

PMCID: 11380061

Digital Epidemiology of Prescription Drug References on X/Twitter: Neural Network Topic Modeling and Sentiment Analysis

  • Varun K Rao; 
  • Danny Valdez; 
  • Rasika Muralidharan; 
  • Jon Agley; 
  • Kate S Eddens; 
  • Aravind Dendukuri; 
  • Vandana Panth; 
  • Maria A Parker

ABSTRACT

Background:

Data from the social media platform X (formerly Twitter) can provide insights into the types of language that are used when discussing drug use. In past research using latent Dirichlet allocation (LDA), we found that tweets containing “street names” of prescription drugs were difficult to classify due to the similarity to other colloquialisms and lack of clarity over how the terms were used. Conversely, “brand name” references were more amenable to machine-driven categorization.

Objective:

This study sought to use next generation techniques (beyond LDA) from natural language processing to re-process X data and automatically cluster groups of tweets into topics to differentiate between street and brand name datasets. We also aimed to analyze differences in emotional valence between the two datasets to study the relationship between engagement on social media and sentiment.

Methods:

We used the Twitter API (Application Programming Interface) to collect tweets that contained street and brand name prescription drug within the tweet. Using BERTtopic in combination with UMAP and k-nearest neighbors, we generated topics for the street name corpus (n=170,618) and brand name corpus (n=245,145). VADER scores were used to classify whether tweets within topics had positive, negative, or neutral sentiment. Two different logistic regression classifiers were used to predict the sentiment label within each corpus. The first model used a tweet’s engagement metrics and topic ID to predict the label, while the second model used those features in addition to the top 5,000 tweets with the largest term-frequency-inverse document frequency (TF-IDF) score.

Results:

Using BERTtopic, we identified 40 topics for the street name dataset and 5 topics for the brand name dataset which we generalized into 8 and 5 topics of discussion, respectively. Four of the general themes of discussion in the brand name corpus referenced drug use, while 2 themes of discussion in the street name referenced drug use. From the VADER scores, we found both corpora were inclined toward positive sentiment. Adding the vectorized tweet text increased the accuracy of our models by around 40% compared to the models that did not incorporate the tweet text in both corpuses. This resulted in models that were accurate approximately 86% of the time between corpuses.

Conclusions:

BERTtopic was able to classify tweets well. As with LDA, discussion using brand names was more similar between tweets than street name discussion. VADER scores could only be logically applied to the brand name corpus owing to the high prevalence of non-drug-related topics in the street name data. Brand-name tweets either discussed drugs positively or negative, with few posts having a neutral emotionality. From our machine learning models, engagement alone was not enough to predict the sentiment label; the added context from the tweets was needed to understand the emotionality of a tweet.


 Citation

Please cite as:

Rao VK, Valdez D, Muralidharan R, Agley J, Eddens KS, Dendukuri A, Panth V, Parker MA

Digital Epidemiology of Prescription Drug References on X (Formerly Twitter): Neural Network Topic Modeling and Sentiment Analysis

J Med Internet Res 2024;26:e57885

DOI: 10.2196/57885

PMID: 39178036

PMCID: 11380061

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.