Accepted for/Published in: Interactive Journal of Medical Research
Date Submitted: Nov 15, 2022
Date Accepted: May 9, 2023
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Normal Workflow and Key Strategies for Data Cleaning: Towards Real-World Data
ABSTRACT
Real-world research inevitably leads to the generation of "dirty data", which can seriously impact data utilization and the quality of decision-making. Data cleaning is a critical method for improving data quality. However, the current literature surrounding real-world research provides little guidance on how to set up and carry out data cleaning efforts both efficiently and ethically. To address this issue, we propose a data cleaning framework for real-world research, focusing on the three most common types of "dirty data,” (duplicate data, missing data, and outlier data), as well as a normal workflow for data cleaning to provide a reference for the application of such technologies in future studies.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.