Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Public Health and Surveillance

Date Submitted: Apr 21, 2020
Date Accepted: Jun 3, 2020
Date Submitted to PubMed: Jun 3, 2020

The final, peer-reviewed published version of this preprint can be found here:

Machine Learning to Detect Self-Reporting of Symptoms, Testing Access, and Recovery Associated With COVID-19 on Twitter: Retrospective Big Data Infoveillance Study

Mackey T, Purushothaman V, Li J, Shah N, Nali M, Bardier C, Liang B, Cai M, Cuomo R

Machine Learning to Detect Self-Reporting of Symptoms, Testing Access, and Recovery Associated With COVID-19 on Twitter: Retrospective Big Data Infoveillance Study

JMIR Public Health Surveill 2020;6(2):e19509

DOI: 10.2196/19509

PMID: 32490846

PMCID: 7282475

Machine Learning to Detect Self-Reporting of Symptoms, Testing Access and Recovery Associated with COVID-19 on Twitter: A Retrospective Big-Data Infoveillance Study

  • Tim Mackey; 
  • Vidya Purushothaman; 
  • Jiawei Li; 
  • Neal Shah; 
  • Matthew Nali; 
  • Cortni Bardier; 
  • Bryan Liang; 
  • Mingxiang Cai; 
  • Raphael Cuomo

ABSTRACT

Background:

The coronavirus (COVID-19) pandemic is a globally, rapidly spreading event with close to 2.5 million cases as of mid-April, representing an outbreak of historical scope and one with an accelerating trajectory. However, there are ongoing concerns about the accuracy of COVID-19 case counts due to issues such as lack of access to testing and difficulty in measuring recoveries.

Objective:

The aims of this study were to detect and characterize user-generated conversations of COVID-19-related symptoms, experiences with access to testing, and mentions of recovery using an unsupervised machine learning approach.

Methods:

Tweets were collected from the Twitter public API from March 3-20 filtered for general COVID-19-related keywords and then further filtered for terms related to COVID-19 symptoms. After data cleaning and processing data, tweets were analyzed using an unsupervised machine learning approach called the biterm topic model (BTM), where groups of tweets containing the same word-related themes were separated into topic clusters related to COVID-19 symptoms, testing, and recovery conversations. Tweets in these clusters were then extracted and manually annotated for content analysis and then analyzed for statistical and geographic characteristics.

Results:

A total of 4,492,954 tweets were collected that contained COVID-19-related symptom terms. After using BTM to identify relevant COVID-19 clusters and removing duplicate tweets, we identified a total of 3,465 (<1%) tweets that included user generated conversations about experiences perceived to be related to COVID-19. These tweets were grouped into five main categories including first and second-hand reports of COVID-19-related symptoms, symptom reporting concurrent with lack of access to testing, discussion of recovery, confirmation of negative COVID-19 diagnosis after receiving testing, and users recalling past symptoms and questioning whether they had been previously infected with COVID-19. Co-occurrence of themes was statistically significant for users reporting symptoms with lack of testing and with discussion of recovery. Sixty-three percent (n=1112) of tweets with geospatial coordinates were from the U.S.

Conclusions:

In this study, we analyzed Twitter for the purposes of characterizing conversations regarding self-reporting of COVID-19-related symptoms, access to testing, and experiences with purported recovery for the purposes of digital contact tracing. It appears that many users reported COVID-19-related symptoms, but never got tested due to lack of access. However, it is unclear how many of these users were actual cases and in the absence of further testing, accurate case estimations may never be known. Future studies should continue to explore the utility of social media and other forms of electronic data to estimate COVID-19 disease severity.


 Citation

Please cite as:

Mackey T, Purushothaman V, Li J, Shah N, Nali M, Bardier C, Liang B, Cai M, Cuomo R

Machine Learning to Detect Self-Reporting of Symptoms, Testing Access, and Recovery Associated With COVID-19 on Twitter: Retrospective Big Data Infoveillance Study

JMIR Public Health Surveill 2020;6(2):e19509

DOI: 10.2196/19509

PMID: 32490846

PMCID: 7282475

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.