Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Infodemiology

Date Submitted: May 10, 2025
Date Accepted: Sep 20, 2025

The final, peer-reviewed published version of this preprint can be found here:

Monitoring Opioid-Related Social Media Chatter Using Natural Language Processing and Large Language Models: Temporal Analysis

Sidorov G, Ahmad M, Basile P, Waqas M, Orji R, Batyrshin I

Monitoring Opioid-Related Social Media Chatter Using Natural Language Processing and Large Language Models: Temporal Analysis

JMIR Infodemiology 2025;5:e77279

DOI: 10.2196/77279

PMID: 41187282

PMCID: 12585000

Monitoring Opioid-Related Social Media Chatters Using NLP and Large Language Model: A Temporal Analysis

  • Grigori Sidorov; 
  • Muhammad Ahmad; 
  • Pierpaolo Basile; 
  • Muhammad Waqas; 
  • Rita Orji; 
  • Ildar Batyrshin

ABSTRACT

Background:

Opioid overdose has become a global public health emergency, with the United States experiencing particularly high rates of morbidity and mortality due to both prescription and illicit opioid use. Traditional public health monitoring systems often fail to provide real-time insights, limiting their capacity for early detection and intervention. Social media platforms, especially Reddit, offer a promising alternative for timely toxicovigilance due to the abundance of user-generated, real-time content.

Objective:

This study aims to explore the use of Reddit as a real-time, high-volume source for toxicovigilance and to develop an automated system that can classify and analyze opioid-related social media posts to detect behavioral patterns and monitor the evolution of public discourse on opioid use.

Methods:

To investigate the evolving social media chatter discourse around opioid use, we collected a large-scale dataset from Reddit spanning six years, from January 1, 2018, to December 30, 2023. Using a comprehensive opioid lexicon—including formal drug names, street slang, common misspellings, and abbreviations—we filtered relevant chatters post for further analysis. A subset of this data was manually annotated according to well-defined annotation guidelines into four distinct categories: Self-abuse (chatter describing his/her own experience with opioid use or overdose), External-abuse (use by someone close, such as a friend or family member), Information (general or factual knowledge about opioids), and Unrelated (content not contextually relevant to opioid use). The distribution across categories was as follows: 37.21% Self-abuse, 27.25% External-abuse, 27.57% Information, and 7.97% Unrelated. To automate the classification of opioid-related chatter, we developed a robust NLP pipeline leveraging classical machine learning algorithms, deep learning models, transformer-based architectures, and fine-tuned a state-of-the-art large language model (OpenAI GPT-3.5 Turbo). In the final stage, the trained LLM was deployed on an unlabeled dataset comprising 74,975 additional Reddit chatter entries. This enabled a detailed temporal analysis of opioid-related discussions over the six-year period, uncovering trends and shifts in public perception, self-reported use, external reported use, and information sharing around opioid drugs. This methodology demonstrates the power of combining manual annotation with cutting-edge language models for real-time toxicovigilance and public health monitoring.

Results:

The fine-tuned GPT-3.5 Turbo model achieved the highest classification accuracy of 0.86, outperforming the mBERT model (0.81) by representing a performance improvement of 6.17% over the Transformer model. The temporal analysis of the unlabeled data revealed evolving trends in opioid-related discussions, indicating shifts in user behavior and overdose-related chatter over time.

Conclusions:

This study demonstrates the potential of integrating advanced NLP techniques and LLMs with social media data to support real-time public health surveillance. Reddit provides a valuable platform for identifying emerging trends in opioid use and overdose risk. The proposed system offers a proactive tool for researchers, clinicians, and policymakers to better understand and respond to the opioid crisis.


 Citation

Please cite as:

Sidorov G, Ahmad M, Basile P, Waqas M, Orji R, Batyrshin I

Monitoring Opioid-Related Social Media Chatter Using Natural Language Processing and Large Language Models: Temporal Analysis

JMIR Infodemiology 2025;5:e77279

DOI: 10.2196/77279

PMID: 41187282

PMCID: 12585000

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.