Accepted for/Published in: JMIR Formative Research
Date Submitted: Dec 17, 2021
Date Accepted: May 20, 2022
Strategies and lessons learned during data cleaning of a cross-sectional web-based health behavior survey study conducted among research panel participants
ABSTRACT
Background:
The use of web-based methods to collect population-based health behavior data has burgeoned over the past few decades. Researchers have used web-based platforms and research panels to study a myriad of topics. Data cleaning prior to web-based survey data analysis is an important step for data integrity. However, the data cleaning processes utilized by research teams are often not reported.
Objective:
The objectives of this manuscript are to describe the use of a systematic approach to clean the data collected via a web-based platform from panelists and to share lessons learned with other research teams to promote quality data cleaning process improvement.
Methods:
Data for this web-based survey study were collected from a research panel that is available for scientific and marketing research. Participants (N=4,000) were panelists recruited either directly and/or through verified partners of the research panel, aged 18-45, living in the United States, with English-language proficiency, and access to the internet. Eligible participants completed a health behavior survey via Qualtrics. Prior to conducting statistical analyses and informed by recommendations from the literature, our interdisciplinary research team developed and implemented a systematic and sequential plan to inform data cleaning processes. This included: 1) reviewing survey completion speed, 2) identifying consecutive responses, 3) identifying cases with contradictory responses, and 4) assessing the quality of open-ended responses. Implementation of these strategies is described in detail and a Checklist for E-Survey Data Integrity (CESDI) is offered as a tool for other investigators.
Results:
Data cleaning procedures resulted in the removal of 1278 (32%) of the response records due to being identified as failing one or more data quality checks. First, approximately one-sixth of records (n=648) were removed because they completed the survey unrealistically quickly (<10 minutes). Next, about 7% (n=292) of records were removed because they contained evidence of consecutive responses. Then, about 5% (n=187) of records were removed due to instances of conflicting responses. Finally, less than 5% of records (n=151) were removed due to poor-quality open-ended responses. Thus, after these data cleaning steps, the final sample contained 2,722 responses.
Conclusions:
Examining data integrity and promoting data cleaning reporting transparency is imperative for web-based survey research. Ensuring high quality data both prior to and following data collection is important. Our systematic approach helped eliminate records flagged as being of questionable quality. Data cleaning and management procedures should be reported more frequently and systematic approaches should be adopted as standards of good practice in this type of research.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.