Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Oct 27, 2020
Date Accepted: Dec 14, 2020
Date Submitted to PubMed: Jan 15, 2021

The final, peer-reviewed published version of this preprint can be found here:

Toward Using Twitter for Tracking COVID-19: A Natural Language Processing Pipeline and Exploratory Data Set

Klein AZ, Magge A, O'Connor K, Flores I, Weissenbacher D, Gonzalez-Hernandez G

Toward Using Twitter for Tracking COVID-19: A Natural Language Processing Pipeline and Exploratory Data Set

J Med Internet Res 2021;23(1):e25314

DOI: 10.2196/25314

PMID: 33449904

PMCID: 7834613

Towards Utilizing Twitter for Tracking COVID-19: A Natural Language Processing Pipeline and Exploratory Data Set

  • Ari Z Klein; 
  • Arjun Magge; 
  • Karen O'Connor; 
  • Ivan Flores; 
  • Davy Weissenbacher; 
  • Graciela Gonzalez-Hernandez

ABSTRACT

Background:

In the United States, the rapidly evolving outbreak of COVID-19, the shortage of available testing, and the delay of test results present challenges for actively monitoring its spread based on testing alone.

Objective:

The objective of this study was to develop, evaluate, and deploy an automatic natural language processing pipeline to collect user-generated Twitter data as a complementary resource for identifying potential cases of COVID-19 in the United States that are not based on testing and, thus, may not have been reported to the CDC.

Methods:

Beginning January 23, 2020, we collected tweets from the Twitter Streaming API that mention keywords related to COVID-19. We applied hand-written regular expressions to identify tweets potentially indicating that the user has been exposed to COVID-19. We automatically filtered out “reported speech” (e.g., quotations, news headlines) from the tweets that matched the regular expressions, and two annotators annotated a random sample of 8976 tweets that are geo-tagged or have profile location metadata, distinguishing tweets that self-report potential cases of COVID-19 from those that do not. We used the annotated tweets to train and evaluate deep neural network classifiers based on pre-trained transformer models. Finally, we deployed the automatic pipeline on more than 85 million unlabeled tweets that were continuously collected between March 1, 2020 and August 21, 2020.

Results:

Inter-annotator agreement, based on dual annotations for 3644 (41%) of the 8976 tweets, was 0.77 (Cohen’s kappa). A deep neural network classifier, based on a BERT model that was pre-trained on tweets related to COVID-19, achieved an F1-score of 0.76 (precision = 0.76, recall = 0.76) for detecting tweets that self-report potential cases of COVID-19. Upon deploying our automatic pipeline, we identified 13,714 tweets that self-report potential cases of COVID-19 and have United States state-level geolocations.

Conclusions:

We have made the 13,714 tweets identified in this study, along with their posting dates and inferred state-level geolocations, publicly available to download. This data set presents the opportunity for future work to assess the utility of Twitter data as a complementary resource for tracking the spread of COVID-19.


 Citation

Please cite as:

Klein AZ, Magge A, O'Connor K, Flores I, Weissenbacher D, Gonzalez-Hernandez G

Toward Using Twitter for Tracking COVID-19: A Natural Language Processing Pipeline and Exploratory Data Set

J Med Internet Res 2021;23(1):e25314

DOI: 10.2196/25314

PMID: 33449904

PMCID: 7834613

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.