Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Public Health and Surveillance

Date Submitted: Jul 23, 2021
Open Peer Review Period: Jul 23, 2021 - Sep 17, 2021
Date Accepted: Oct 13, 2021
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Physical Activity, Sedentary Behavior, and Sleep on Twitter: Multicountry and Fully Labeled Public Data Set for Digital Public Health Surveillance Research

Shakeri Hossein Abad Z, Butler GP, Thompson W, Lee J

Physical Activity, Sedentary Behavior, and Sleep on Twitter: Multicountry and Fully Labeled Public Data Set for Digital Public Health Surveillance Research

JMIR Public Health Surveill 2022;8(2):e32355

DOI: 10.2196/32355

PMID: 35156938

PMCID: 8887637

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Physical activity, sedentary behaviour, and sleep on Twitter: A multicountry and fully labelled dataset for public health surveillance research

  • Zahra Shakeri Hossein Abad; 
  • Gregory P. Butler; 
  • Wendy Thompson; 
  • Joon Lee

ABSTRACT

Background:

Advances in automated data processing and machine learning (ML) models, together with the unprecedented growth in the number of social media users who publicly share and discuss health-related information, have made public health surveillance (PHS) one of the long-lasting social media applications. However, the existing PHS systems feeding on social media data have not been widely deployed in national surveillance systems, which appears to stem from the lack of practitioners and the public’s trust in social media data. More robust and reliable datasets over which supervised machine learning models can be trained and tested reliably is a significant step toward overcoming this hurdle.

Objective:

The health implications of daily behaviours (physical activity, sedentary behaviour, and sleep (PASS)), as an evergreen topic in PHS, are widely studied through traditional data sources such as surveillance surveys and administrative databases, which are often several months out of date by the time they are utilized, costly to collect, and thus limited in quantity and coverage. In this paper, we present LPHEADA, a multicountry and fully Labelled digital Public HEAlth DAtaset of tweets originated in Australia, Canada, the United Kingdom (UK), or the United States (US).

Methods:

We collected the data of this study from Twitter using the Twitter livestream application programming interface (API) between 28th November 2018 to 19th June 2020. To obtain PASS-related tweets for manual annotation, we iteratively used regular expressions, unsupervised natural language processing, domain-specific ontologies and linguistic analysis. We used Amazon Mechanical Turk (AMT) to label the collected data to self-reported PASS categories and implemented a quality control pipeline to monitor and manage the validity of crow-generated labels. Moreover, we used ML, latent semantic analysis, linguistic analysis, and label inference analysis to validate different components of the dataset.

Results:

LPHEADA contains 366,405 crowd-generated labels (three labels per tweet) for 122,135 PASS-related tweets, labelled by 708 unique annotators on AMT. In addition to crowd-generated labels, LPHEADA provides details about the three critical components of any PHS system: place, time, and demographics (gender, age range) associated with each tweet.

Conclusions:

Publicly available datasets for digital PASS surveillance are usually isolated and only provide labels for small subsets of the data. We believe that the novelty and comprehensiveness of the dataset provided in this study will help develop, evaluate, and deploy digital PASS surveillance systems. LPHEADA will be an invaluable resource for both public health researchers and practitioners.


 Citation

Please cite as:

Shakeri Hossein Abad Z, Butler GP, Thompson W, Lee J

Physical Activity, Sedentary Behavior, and Sleep on Twitter: Multicountry and Fully Labeled Public Data Set for Digital Public Health Surveillance Research

JMIR Public Health Surveill 2022;8(2):e32355

DOI: 10.2196/32355

PMID: 35156938

PMCID: 8887637

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.