Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Dec 23, 2025
Date Accepted: May 5, 2026
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Addressing Data Quality Challenges in OMOP CDM: A Case Study on Lung Cancer Data Mapping
ABSTRACT
Background:
The secondary use of health data is essential for advancing medical research and improving clinical practices. The Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) enables large-scale, multi-center studies but faces challenges in consistency, completeness, and transparency during data mapping from the original data sources.
Objective:
This study aimed to evaluate the quality of the mapping process for lung cancer data within the Federated Health Innovation Network (FHIN) project, focusing on consistency, completeness, and challenges encountered throughout the process.
Methods:
Clinical data from Ghent University Hospital was mapped to the OMOP CDM using a reference data dictionary. Consistency was assessed through Cohen’s kappa scores, while completeness was evaluated by comparing patient and record counts pre- and post-mapping. Challenges, including unstructured data and evolving reference standards, were documented and analysed.
Results:
High consistency was observed for structured variables, while some unstructured variables like “Smoking status” were excluded due to free-text format and a lack of suitable OMOP concepts. Completeness analysis showed minimal data loss for most structured variables but significant challenges for unstructured data. Persistent issues included evolving data dictionary versions and diagnostic code granularity mismatches between institutions, underscoring structural challenges in standardization.
Conclusions:
The transformation of lung cancer data to the OMOP CDM highlights both technical and systemic challenges, including handling unstructured data and addressing granularity discrepancies. A multidisciplinary approach involving clinical and technical expertise is crucial to ensure reliable, high-quality datasets for multi-center research.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.