Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Oct 27, 2020
Date Accepted: Dec 14, 2020
Date Submitted to PubMed: Jan 15, 2021
Towards Utilizing Twitter for Tracking COVID-19: A Natural Language Processing Pipeline and Exploratory Data Set
ABSTRACT
Background:
In the United States, the rapidly evolving outbreak of COVID-19, the shortage of available testing, and the delay of test results present challenges for actively monitoring its spread based on testing alone.
Objective:
The objective of this study was to develop, evaluate, and deploy an automatic natural language processing pipeline to collect user-generated Twitter data as a complementary resource for identifying potential cases of COVID-19 in the United States that are not based on testing and, thus, may not have been reported to the CDC.
Methods:
Beginning January 23, 2020, we collected tweets from the Twitter Streaming API that mention keywords related to COVID-19. We applied hand-written regular expressions to identify tweets potentially indicating that the user has been exposed to COVID-19. We automatically filtered out “reported speech” (e.g., quotations, news headlines) from the tweets that matched the regular expressions, and two annotators annotated a random sample of 8976 tweets that are geo-tagged or have profile location metadata, distinguishing tweets that self-report potential cases of COVID-19 from those that do not. We used the annotated tweets to train and evaluate deep neural network classifiers based on pre-trained transformer models. Finally, we deployed the automatic pipeline on more than 85 million unlabeled tweets that were continuously collected between March 1, 2020 and August 21, 2020.
Results:
Inter-annotator agreement, based on dual annotations for 3644 (41%) of the 8976 tweets, was 0.77 (Cohen’s kappa). A deep neural network classifier, based on a BERT model that was pre-trained on tweets related to COVID-19, achieved an F1-score of 0.76 (precision = 0.76, recall = 0.76) for detecting tweets that self-report potential cases of COVID-19. Upon deploying our automatic pipeline, we identified 13,714 tweets that self-report potential cases of COVID-19 and have United States state-level geolocations.
Conclusions:
We have made the 13,714 tweets identified in this study, along with their posting dates and inferred state-level geolocations, publicly available to download. This data set presents the opportunity for future work to assess the utility of Twitter data as a complementary resource for tracking the spread of COVID-19.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.