Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Apr 16, 2020
Open Peer Review Period: Apr 16, 2020 - Apr 23, 2020
Date Accepted: May 25, 2020
Date Submitted to PubMed: May 26, 2020
(closed for review but you can still tweet)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Using Reports of Own and Others’ Symptoms and Diagnosis on Social Media to Predict COVID-19 Case Counts: Observational Study in Mainland China
ABSTRACT
Background:
COVID-19 has already affected more than 200 countries and territories worldwide. It poses an extraordinary challenge for public health systems, because screening and surveillance capacity—especially during the beginning of the outbreak—is often severely limited, fueling the outbreak as many patients unknowingly infect others.
Objective:
We present an effort to collect and analyze COVID-19 related posts on the popular Twitter-like social media site in China, Weibo. To our knowledge, this is the first study that examines comprehensive and fine-grained social media data to predict COVID-19 case counts in mainland China.
Methods:
Using a comprehensive list of 167 keywords, we retrieved and analyzed more than 12 million COVID-19 related posts, from November 20, 2019 to March 3, 2020. We developed a machine learning classifier to identify “sick posts,” which are reports of one’s own and other people’s symptoms and diagnosis related to COVID-19. Using officially reported case counts as the outcome, we then modeled the predictive power of sick posts and other COVID-19 posts on daily case counts. For a subset of geotagged posts (2.85% of all retrieved posts), we also ran separate predictive models for Hubei province, the epicenter of the initial outbreak, and the rest of mainland China.
Results:
We found that reports of symptoms and diagnosis of COVID-19 significantly predicted daily case counts, up to seven days ahead of official statistics. But other COVID-19 posts did not have similar predictive power. For the subset of geotagged posts, we found that the predictive pattern held true for both Hubei province and the rest of mainland China, regardless of unequal distribution of healthcare resources and outbreak timeline.
Conclusions:
Public social media data can be usefully harnessed to predict infection cases and inform timely responses. Researchers and disease control agencies should pay close attention to the social media infosphere regarding COVID-19. On top of monitoring overall search and posting activities, it is crucial to sift through the contents and efficiently identify true signals from noise.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.