Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Mar 31, 2025
Date Accepted: Oct 6, 2025
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Assessing Data Quality in Heterogeneous Healthcare Integration: The AIDAVA Framework
ABSTRACT
Background:
Integrated health data is foundational for secondary use, research, and policy making. However, data quality issues – such as missing values and inconsistencies – are common due to the heterogeneity of health data sources. Existing frameworks often apply static, one-time assessments, limiting their ability to address quality problems across evolving data pipelines.
Objective:
This study evaluates the AIDAVA data quality framework, which introduces dynamic, lifecycle-based validation of health data using knowledge graph technologies and SHACL-based rules. The framework is assessed for its ability to detect and manage data quality issues – specifically, completeness and consistency – during integration.
Methods:
Using the MIMIC-III dataset, we simulated real-world data quality challenges by introducing structured noise, including missing values and logical inconsistencies. The data was transformed into Source Knowledge Graphs (SKGs) and integrated into a unified Personal Health Knowledge Graph (PHKG). SHACL validation rules were applied iteratively during the integration process, and data quality was assessed under varying noise levels and integration orders.
Results:
The AIDAVA framework effectively detected completeness and consistency issues across all scenarios. Completeness was shown to influence the interpretability of consistency scores, and domain-specific attributes (e.g., diagnoses, procedures) were more sensitive to integration order and data gaps.
Conclusions:
AIDAVA supports dynamic, rule-based validation throughout the data lifecycle. By addressing both dimension-specific vulnerabilities and cross-dimensional effects, it lays the groundwork for scalable, high-quality health data integration. Future work should explore deployment in live clinical settings and expand to additional quality dimensions.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.