Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Metadata Needs Assessment for Data Reuse: Inventory and Concept Mapping Based on a Real-World Case Study
ABSTRACT
Background:
Longitudinal observational databases collect real–world data (RWD), registered by healthcare providers, and make these data available to researchers. Metadata, that is, data describing other data, are crucial to facilitate meaningful interpretation of such RWD. The FAIR Principles hold that data and metadata should be richly described with accurate, relevant attributes. Yet, data reuse is impeded by low–quality or absent metadata. Despite existence of metadata frameworks, supporting data annotation, there is little insight into the actual metadata needed to interpret third–party data.
Objective:
The aim of the study was to gain such insight by exploring the metadata needs in a real–world study, reusing RWD from two organizations that collected general practitioner patient data. We started our real–world study with a specific research goal (ie, chronic kidney disease cohort identification) and identified what metadata were needed to reach this goal.
Methods:
The metadata elicitation process involved inventorying all metadata documentation available to the researchers, covering metadata fragments (eg, data dictionaries) and records of interactions (eg, email exchange, meeting minutes). We compiled both the metadata required to understand the data or related inquiries. After deduplication and merging these items, stages of concept mapping were employed to identify categories of metadata by creating cluster maps and to inspect perceived importance.
Results:
A diverse group of 23 participants took part in the concept mapping. We identified 84 metadata items within 9 distinct clusters, including data collection, data processing, data quality, and data modelling. The variety of items and clusters illustrate the challenge of achieving a predefined metadata set. Most items (70/84) were rated on average as moderately important (3) to important (4) on a 5–point Likert scale. Categories concerning features that enable data interpretation were rated as more important (3.638 [SD 0.836] – 3.739 [SD 0.841]) than those focused on technical details (2.876 [SD 0.954] – 3.261 [SD 0.832]). Most items (79/84) and all categories are not domain–specific for the descriptive study. While existing frameworks offer relevant high–level metadata, they do not accommodate the detailed insights uncovered through our practice–based metadata elicitation.
Conclusions:
Our study shows that for a practical use case an extensive set of metadata items is required which is unlikely to be available upfront. However, as most of the required items are generic, they can be specified and made available on demand resulting in an increasingly rich set of metadata. These results guide the further development of more in–depth metadata frameworks and of procedures for incremental specification of metadata.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.