Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Public Health and Surveillance

Date Submitted: May 28, 2020
Date Accepted: Aug 3, 2020
Date Submitted to PubMed: Aug 4, 2020

The final, peer-reviewed published version of this preprint can be found here:

Big Data, Natural Language Processing, and Deep Learning to Detect and Characterize Illicit COVID-19 Product Sales: Infoveillance Study on Twitter and Instagram

Mackey T, Li J, Purushothaman V, Nali M, Shah N, Bardier C, Cai M, Liang B

Big Data, Natural Language Processing, and Deep Learning to Detect and Characterize Illicit COVID-19 Product Sales: Infoveillance Study on Twitter and Instagram

JMIR Public Health Surveill 2020;6(3):e20794

DOI: 10.2196/20794

PMID: 32750006

PMCID: 7451110

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Big Data, Natural Language Processing, and Deep Learning to Detect and Characterize Illicit COVID-19 Product Sales: An Infoveillance Study on Twitter and Instagram

  • Tim Mackey; 
  • Jiawei Li; 
  • Vidya Purushothaman; 
  • Matthew Nali; 
  • Neal Shah; 
  • Cortni Bardier; 
  • Mingxiang Cai; 
  • Bryan Liang

ABSTRACT

Background:

The COVID-19 pandemic is perhaps the greatest global health challenge of the last century. Accompanying this pandemic is a parallel “infodemic”, including the online marketing and sale of unapproved, illegal and counterfeit COVID-19 health products, including testing kits, treatments, and other questionable “cures”. Enabling proliferation of this content is growing ubiquity of Internet-based technologies, including popular social media platforms that now have billions of global users.

Objective:

To collect, analyze, identify and enable reporting of suspected fake, counterfeit, and unapproved COVID-19-related healthcare products from Twitter and Instagram.

Methods:

The study was conducted in two phases beginning with collection of COVID-19-related Twitter and Instagram posts using a combination of web scraping on Instagram and filtering the public streaming Twitter API for keywords associated with suspect marketing and sale of COVID-19 products. The second phase involved data analysis using natural language processing and deep learning to identify potential sellers that were then manually annotated for characteristics of interest. We also visualized illegal selling posts on a customized data dashboard to enable public health intelligence.

Results:

We collected a total of 6,029,323 tweets and 204,597 Instagram posts filtered for terms associated with suspect marketing and sale of COVID-19 health products from March – April for Twitter and February – May for Instagram. After applying our NLP and deep learning approaches, we identified 1,271 tweets and 596 Instagram posts associated with questionable sales of COVID-19-related products. Generally, product introduction came in three waves, with the first consisting of questionable immunity-boosting treatments, a second involving suspect testing kits, and a third of pharmaceuticals that have not been approved for COVID-19 treatment, with these waves following news coverage about product developments. Other major themes detected included accounts with descriptive COVID-19 accounts, products offered in different languages, various claims of product credibility, unsubstantiated products, unapproved testing modalities, and different payment and seller contact methods.

Conclusions:

Results from this study provide initial insight into one front of the “infodemic” fight against COVID-19 by characterizing what types of health products, selling claims and types of sellers are active on two popular social media platforms. The challenge of combating this form of cybercrime is likely to continue as the pandemic progresses and more people seek access to COVID-19 information and treatment. Visualization of detected sellers and identification of their social media communication strategies can provide needed intelligence to public health agencies, regulatory authorities, legitimate manufacturers, and technology platforms to better remove and prevent this content from harming the public.


 Citation

Please cite as:

Mackey T, Li J, Purushothaman V, Nali M, Shah N, Bardier C, Cai M, Liang B

Big Data, Natural Language Processing, and Deep Learning to Detect and Characterize Illicit COVID-19 Product Sales: Infoveillance Study on Twitter and Instagram

JMIR Public Health Surveill 2020;6(3):e20794

DOI: 10.2196/20794

PMID: 32750006

PMCID: 7451110

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.