Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Aug 13, 2019
Date Accepted: Dec 15, 2019

The final, peer-reviewed published version of this preprint can be found here:

Promoting Reproducible Research for Characterizing Nonmedical Use of Medications Through Data Annotation: Description of a Twitter Corpus and Guidelines

O'Connor K, Sarker A, Perrone J, Gonzalez-Hernandez G

Promoting Reproducible Research for Characterizing Nonmedical Use of Medications Through Data Annotation: Description of a Twitter Corpus and Guidelines

J Med Internet Res 2020;22(2):e15861

DOI: 10.2196/15861

PMID: 32130117

PMCID: 7066507

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Monitoring Prescription Medication Abuse from Twitter: An Annotated Corpus and Annotation Guidelines for Reproducible Machine Learning Research

  • Karen O'Connor; 
  • Abeed Sarker; 
  • Jeanmarie Perrone; 
  • Graciela Gonzalez-Hernandez

ABSTRACT

Background:

Social media data is being increasingly utilized for population-level health research as it provides near real time access to large volumes of consumer generated data. In the recent past, a number of studies have explored the possibility of utilizing social media data, such as from Twitter, for monitoring prescription medication abuse. However, there is a paucity of annotated data or guidelines for data characterization discussing how information related to abuse-prone medications are presented in Twitter.

Objective:

Our primary objective in this paper is to discuss the creation of an annotated corpus suitable for training supervised classification algorithms for automatic classification of medication abuse-related chatter. We also describe the annotation strategies we used for improving inter-annotator agreement, a detailed annotation guideline and machine learning experiments illustrating the utility of the annotated corpus.

Methods:

We employed an iterative annotation strategy, involving inter-annotator discussions and the updating of an annotation guideline at each iteration to improve inter-annotator agreement for the manual annotation task. Using the grounded theory approach, we first characterized tweets into fine-grained categories and then grouped them into four broad classes—abuse/misuse, personal consumption, mention and unrelated. Following the completion of the manual annotations, we experimented with several machine learning algorithms to illustrate the utility of the corpus and also to generate baseline performance metrics for automatic classification on this data.

Results:

Our final annotated set consists of 16,433 tweets mentioning at least 20 abuse-prone medications including opioids, benzodiazepines, atypical antipsychotics, central nervous system stimulants and GABA analogues. Our final overall inter-annotator agreement was 0.86 (Cohen’s kappa), which represents high agreement. The manual annotation process revealed the variety of ways in which prescription medication misuse/abuse is discussed on Twitter, including expressions indicating co-ingestion, nonmedical use, non-standard route of intake, and consumption above prescribed dosages. Among machine learning classifiers, support vector machines obtained the highest automatic classification accuracy of 73.0% (95% CI: 71.4—74.5) over the test set (n=3,271).

Conclusions:

Our manual analysis and annotations of a large number of tweets have revealed types of information posted on Twitter about a set of abuse-prone prescription medications and their distributions. In the interests of reproducible and community-driven research, we have made our detailed annotation guidelines and the training data for the classification experiments publicly available, and the test data will be used in future shared tasks.


 Citation

Please cite as:

O'Connor K, Sarker A, Perrone J, Gonzalez-Hernandez G

Promoting Reproducible Research for Characterizing Nonmedical Use of Medications Through Data Annotation: Description of a Twitter Corpus and Guidelines

J Med Internet Res 2020;22(2):e15861

DOI: 10.2196/15861

PMID: 32130117

PMCID: 7066507

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.