Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Apr 23, 2025
Date Accepted: Jan 28, 2026
Beyond Missingness: Systematizing Methods for Comprehensive Data Fitness Assessment in Clinical Research
ABSTRACT
Background:
Secondary use of clinical data offers the unprecedented opportunity to rapidly conduct large-scale research and improve patient care. However, incomplete understanding of data quality requirements for a study often causes significant delays in executing analyses and validating results. While multi-institutional networks have developed general data quality programs, they are closely tied to network characteristics and do not address many study-specific requirements. In turn, most investigators conduct per-study exploratory analyses, but these efforts are typically ad hoc and partially reported, which can hinder reproducible science and delay advances in patient care.
Objective:
The objectives of this study were to develop a comprehensive yet extensible model that guides study-specific data quality assessment (SSDQA) and to construct reusable data quality modules, based on the model, that can be widely disseminated and adopted.
Methods:
We reviewed current literature and studies, interviewed experts, and convened a group of scientists 14 times over 24 months to develop a strategy for effective and generalizable SSDQA. From this we built a model that provides guidance for improved SSDQA design and implementation, identifying reusable components and software development best practices. We created a set of executable data quality modules addressing multiple check requirements, packaged for dissemination using the R programming language. Modules include tabular output as well as high density data displays to improve the feasibility of more complete SSDQA.
Results:
The data quality model integrates theoretical principles of data quality testing with pragmatic considerations of application to clinical data. Eighteen data quality modules were specified, incorporating a wide range of theory-based check categories and practice-based check types. The nine specifications covering most common SSDQA requirements were developed as reusable software packages. Study-specific user decisions for application of a module include: (a) target data element(s), (b) single- or multi- source analyses, (c) exploratory output or statistical anomaly detection; and (d) longitudinal or cross-sectional analyses. Consistent visualizations of results are provided across multiple checks, and tabular results are available for archiving or further computation. We demonstrate that application of SSDQA modules to a cohort of pediatric patients diagnosed with sickle cell disease drawn from a multisite clinical research network produces informative results about data characteristics that may create analytic risk.
Conclusions:
The study-specific data quality model builds on current practice, providing guidance for more complete and sound assessments. Reusable modules implementing model guidance can be readily deployed in large networks or smaller consortia, and provide detailed reporting of study-specific data quality in reproducible form. The flexibility allows investigators to tailor checks to their study requirements while retaining consistency of DQA methods. This approach fosters collaboration to identify data quality issues that can inform decisions about study design, as well as provide important context bearing on adoption of results.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.