Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Oct 13, 2024
Date Accepted: Mar 20, 2025
Linguistic Markers of Pain Communication on X (Formerly Twitter) in U.S. States With High and Low Opioid Mortality: Machine Learning and Semantic Network Analysis
ABSTRACT
Background:
The opioid epidemic in the United States remains a major public health concern, with opioid-related deaths increasing more than eightfold since 1999. Chronic pain, affecting one in five U.S. adults, is a key contributor to opioid use and misuse. While prior research has explored clinical and behavioral predictors of opioid risk, less attention has been given to large-scale linguistic patterns in public discussions of pain. Social media platforms like X (formerly Twitter) offer real-time, population-level insights into how individuals express pain, distress, and coping strategies. Understanding these linguistic markers matters because they can reveal underlying psychological states, perceptions of healthcare access, and community-level opioid risk factors, offering new opportunities for early detection and targeted public health response.
Objective:
We collected 1,438,644 pain-related tweets posted between January and December 2021 using Tweepy and Snscrape. Tweets from two high-opioid mortality states (Ohio, Florida) and two low-opioid mortality states (South Dakota, North Dakota) were selected, resulting in 31,994 tweets from high-death states (HDS) and 750 tweets from low-death states (LDS). Six machine learning algorithms (random forest, K-nearest neighbor, decision tree, naive Bayes, logistic regression, and support vector machine) were applied to predict state-level opioid mortality risk based on linguistic features derived from LIWC. SMOTE was used to address class imbalance. Evaluation metrics included accuracy, balanced accuracy, Kappa, sensitivity, specificity, precision, F1-score, and AUC. Semantic network analysis was conducted to visualize co-occurrence patterns and conceptual clustering.
Methods:
We collected 1,438,644 pain-related tweets from January to December 2021 using Tweepy and Snscrape. Machine learning classification models and semantic network analysis were applied to identify linguistic variations between high- and low-opioid mortality states. To address class im-balance, SMOTE was applied to the training data, while model evaluation was conducted on a held-out test set preserving the original data distribution. A downsampled test set was also used to compare classification outcomes across different sampling strategies.
Results:
The random forest model demonstrated the strongest predictive performance, with 94.69% accuracy, balanced accuracy of 94.69%, Kappa of 0.89, and an AUC of 0.95 (p < .001). Tweets from HDS contained significantly more affective pain words (t(31992)=10.84, p<.001, d=0.12), healthcare access references, and expressions of distress. LDS tweets showed greater use of authenticity markers (t(31992)=-10.04, p<.001) and proactive health-seeking language. Semantic network analysis revealed denser discourse in HDS (density=0.28) focused on suffering and barriers to care, while LDS discourse emphasized recovery and optimism.
Conclusions:
Our findings demonstrate that linguistic markers in publicly shared pain-related discourse show distinct and predictable differences across regions with varying opioid mortality risk. These linguistic patterns reflect underlying psychological, social, and structural factors that contribute to opioid vulnerability. Importantly, they offer a scalable, real-time resource for identifying at-risk communities. Harnessing social media language analytics can strengthen early detection systems, guide geographically targeted public health messaging, and inform policy efforts aimed at reducing opioid-related harm and improving pain management equity.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.