Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Sep 17, 2019
Date Accepted: May 6, 2020
A Comprehensive Platform for Agile Clinical and Translational Data Warehousing
ABSTRACT
Background:
Modern data-driven medical research promises to provide new insights into the development and course of diseases and to enable novel methods of clinical decision support. Clinical and translational data warehouses are an important building block of infrastructures that provide the large datasets needed to realize this. These databases provide users with unied access to heterogeneous datasets and support use cases such as cohort selection, hypothesis generation and ad-hoc data analysis. They can also be used to implement distributed cross-institutional data analyses by representing data in common models using standard terminologies and ontologies.
Objective:
Often, different warehousing platforms are needed to support different use cases and different types of data. Moreover, to achieve an optimal data representation within the target systems, technical know-how as well as project-specific domain knowledge are needed when designing data transformation and loading processes. As a result, informaticians need to work in close cooperation with clinicians and researchers involving short feedback cycles. This is a challenging task, as the installation and maintenance of common warehousing platforms can be complex and time-consuming. Moreover, data loading typically requires significant efforts in terms of data pre-processing, cleansing and restructuring. The work described in this article aimed to address to these challenges.
Methods:
We have developed a (private) cloud infrastructure for managing instances of common biomedical data warehousing platforms, combined with a flexible and easy-to-use pipeline for data loading. The platform supports both i2b2 and tranSMART and it comes with built-in security and comprehensive documentation. The data loading pipeline is based on a declarative configuration paradigm, which enables the agile development of data import processes and the automation of a wide range of common data cleansing and preprocessing tasks.
Results:
The described platform has successfully been used to support a wide range of projects, from which we present three in this paper: one in which we provided translational access to highly structured research data, one in which we supported clinician-scientists by providing them with an overview of longitudinal semi-structured clinical data, and one in which we loaded highly structured and standardized billing data to prepare a large-scale distributed study.
Conclusions:
Our platform significantly simplifies the management of data warehousing platforms and enables quickly loading data in various representations. This enables the agile development of such solutions in close cooperation with end users. Both the cloud-based hosting infrastructure and the data loading pipeline are available to the community as open-source software.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.