Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Apr 23, 2025
Date Accepted: Jan 28, 2026

The final, peer-reviewed published version of this preprint can be found here:

Beyond Missingness: Systematizing Methods for Comprehensive Data Fitness Assessment in Clinical Research

Razzaghi H, Wieand K, Dickinson KL, Kahn MG, Roy J, Blacketer C, Christakis DA, Forrest CB, Greenberg J, Lehmann HP, Marsolo KA, Sciolla J, Weiner MG, Weiskopf NG, Bailey LC

Beyond Missingness: Systematizing Methods for Comprehensive Data Fitness Assessment in Clinical Research

J Med Internet Res 2026;28:e76398

DOI: 10.2196/76398

PMID: 41980192

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Comprehensive Fitness Assessment for Clinical Data: Standards Development and Reusable Tools

  • Hanieh Razzaghi; 
  • Kaleigh Wieand; 
  • Kimberley L. Dickinson; 
  • Michael G. Kahn; 
  • Jason Roy; 
  • Clair Blacketer; 
  • Dmitri A. Christakis; 
  • Christopher B. Forrest; 
  • Jane Greenberg; 
  • Harold P. Lehmann; 
  • Keith A. Marsolo; 
  • Jennifer Sciolla; 
  • Mark G. Weiner; 
  • Ncole G. Weiskopf; 
  • L. Charles Bailey

ABSTRACT

Background:

Secondary use of clinical data offers the unprecedented opportunity to rapidly conduct large-scale research and improve patient care. However, incomplete understanding of data quality requirements for a study often causes significant delays in executing analyses and validating results. While multi-institutional networks have developed general data quality programs, they are closely tied to network characteristics and do not address many study-specific requirements. In turn, most investigators conduct per-study exploratory analyses, but these efforts are typically ad hoc and partially reported, which can hinder reproducible science and delay advances in patient care.

Objective:

The objectives of this study were to develop a comprehensive yet extensible model that guides study-specific data quality assessment (SSDQA) and to construct reusable data quality modules, based on the model, that can be widely disseminated and adopted.

Methods:

We reviewed current literature and studies, interviewed experts, and convened a group of scientists 14 times over 24 months to develop a strategy for effective and generalizable SSDQA. From this we built a model that provides guidance for improved SSDQA design and implementation, identifying reusable components and software development best practices. We created a set of executable data quality modules addressing multiple check requirements, packaged for dissemination using the R programming language. Modules include tabular output as well as high density data displays to improve the feasibility of more complete SSDQA.

Results:

The data quality model integrates theoretical principles of data quality testing with pragmatic considerations of application to clinical data. Eighteen data quality modules were specified, incorporating a wide range of theory-based check categories and practice-based check types. The nine specifications covering most common SSDQA requirements were developed as reusable software packages. Study-specific user decisions for application of a module include: (a) target data element(s), (b) single- or multi- source analyses, (c) exploratory output or statistical anomaly detection; and (d) longitudinal or cross-sectional analyses. Consistent visualizations of results are provided across multiple checks, and tabular results are available for archiving or further computation. We demonstrate that application of SSDQA modules to a cohort of pediatric patients diagnosed with sickle cell disease drawn from a multisite clinical research network produces informative results about data characteristics that may create analytic risk.

Conclusions:

The study-specific data quality model builds on current practice, providing guidance for more complete and sound assessments. Reusable modules implementing model guidance can be readily deployed in large networks or smaller consortia, and provide detailed reporting of study-specific data quality in reproducible form. The flexibility allows investigators to tailor checks to their study requirements while retaining consistency of DQA methods. This approach fosters collaboration to identify data quality issues that can inform decisions about study design, as well as provide important context bearing on adoption of results.


 Citation

Please cite as:

Razzaghi H, Wieand K, Dickinson KL, Kahn MG, Roy J, Blacketer C, Christakis DA, Forrest CB, Greenberg J, Lehmann HP, Marsolo KA, Sciolla J, Weiner MG, Weiskopf NG, Bailey LC

Beyond Missingness: Systematizing Methods for Comprehensive Data Fitness Assessment in Clinical Research

J Med Internet Res 2026;28:e76398

DOI: 10.2196/76398

PMID: 41980192

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.