JMIR Preprints #25314: Towards Utilizing Twitter for Tracking COVID-19: A Natural Language Processing Pipeline and Exploratory Data Set

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Towards Utilizing Twitter for Tracking COVID-19: A Natural Language Processing Pipeline and Exploratory Data Set

Ari Z Klein;
Arjun Magge;
Karen O'Connor;
Ivan Flores;
Davy Weissenbacher;
Graciela Gonzalez-Hernandez

ABSTRACT

Background:

In the United States, the rapidly evolving outbreak of COVID-19, the shortage of available testing, and the delay of test results present challenges for actively monitoring its spread based on testing alone.

Objective:

The objective of this study was to develop, evaluate, and deploy an automatic natural language processing pipeline to collect user-generated Twitter data as a complementary resource for identifying potential cases of COVID-19 in the United States that are not based on testing and, thus, may not have been reported to the CDC.

Methods:

Beginning January 23, 2020, we collected tweets from the Twitter Streaming API that mention keywords related to COVID-19. We applied hand-written regular expressions to identify tweets potentially indicating that the user has been exposed to COVID-19. We automatically filtered out “reported speech” (e.g., quotations, news headlines) from the tweets that matched the regular expressions, and two annotators annotated a random sample of 8976 tweets that are geo-tagged or have profile location metadata, distinguishing tweets that self-report potential cases of COVID-19 from those that do not. We used the annotated tweets to train and evaluate deep neural network classifiers based on pre-trained transformer models. Finally, we deployed the automatic pipeline on more than 85 million unlabeled tweets that were continuously collected between March 1, 2020 and August 21, 2020.

Results:

Inter-annotator agreement, based on dual annotations for 3644 (41%) of the 8976 tweets, was 0.77 (Cohen’s kappa). A deep neural network classifier, based on a BERT model that was pre-trained on tweets related to COVID-19, achieved an F1-score of 0.76 (precision = 0.76, recall = 0.76) for detecting tweets that self-report potential cases of COVID-19. Upon deploying our automatic pipeline, we identified 13,714 tweets that self-report potential cases of COVID-19 and have United States state-level geolocations.

Conclusions:

We have made the 13,714 tweets identified in this study, along with their posting dates and inferred state-level geolocations, publicly available to download. This data set presents the opportunity for future work to assess the utility of Twitter data as a complementary resource for tracking the spread of COVID-19.

Citation

Please cite as:

Klein AZ, Magge A, O'Connor K, Flores I, Weissenbacher D, Gonzalez-Hernandez G

Toward Using Twitter for Tracking COVID-19: A Natural Language Processing Pipeline and Exploratory Data Set

J Med Internet Res 2021;23(1):e25314

DOI: 10.2196/25314

PMID: 33449904

PMCID: 7834613

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Oct 27, 2020

Date Accepted: Dec 14, 2020

Date Submitted to PubMed: Jan 15, 2021

Towards Utilizing Twitter for Tracking COVID-19: A Natural Language Processing Pipeline and Exploratory Data Set

ABSTRACT

Citation

Copyright