Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Apr 16, 2020
Open Peer Review Period: Apr 16, 2020 - Apr 23, 2020
Date Accepted: May 25, 2020
Date Submitted to PubMed: May 26, 2020
(closed for review but you can still tweet)
Using Reports of Own and Others’ Symptoms and Diagnosis on Social Media to Predict COVID-19 Case Counts: Observational Infoveillance Study in Mainland China
ABSTRACT
Background:
COVID-19 has already affected more than 200 countries and territories worldwide. It poses an extraordinary challenge for public health systems, because screening and surveillance capacity—especially during the beginning of the outbreak—is often severely limited, fueling the outbreak as many patients unknowingly infect others.
Objective:
We present an effort to collect and analyze COVID-19 related posts on the popular Twitter-like social media site in China, Weibo. To our knowledge, this infoveillance study employs the largest, most comprehensive and fine-grained social media data to date to predict COVID-19 case counts in mainland China.
Methods:
We built a Weibo user pool of 250 million, approximately half of the entire monthly active Weibo user population. Using a comprehensive list of 167 keywords, we retrieved and analyzed around 15 million COVID-19 related posts from our user pool, from November 1, 2019 to March 31, 2020. We developed a machine learning classifier to identify “sick posts,” which are reports of one’s own and other people’s symptoms and diagnosis related to COVID-19. Using officially reported case counts as the outcome, we then estimated the Granger causality of sick posts and other COVID-19 posts on daily case counts. For a subset of geotagged posts (3.10% of all retrieved posts), we also ran separate predictive models for Hubei province, the epicenter of the initial outbreak, and the rest of mainland China.
Results:
We found that reports of symptoms and diagnosis of COVID-19 significantly predicted daily case counts, up to 14 days ahead of official statistics. But other COVID-19 posts did not have similar predictive power. For the subset of geotagged posts, we found that the predictive pattern held true for both Hubei province and the rest of mainland China, regardless of unequal distribution of healthcare resources and outbreak timeline.
Conclusions:
Public social media data can be usefully harnessed to predict infection cases and inform timely responses. Researchers and disease control agencies should pay close attention to the social media infosphere regarding COVID-19. On top of monitoring overall search and posting activities, leveraging machine learning approaches and theoretical understandings of information sharing behaviors to identify true disease signals is a promising approach to improve the effectiveness of infoveillance.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.