Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Mar 26, 2025
Open Peer Review Period: Mar 26, 2025 - May 21, 2025
Date Accepted: May 28, 2025
(closed for review but you can still tweet)
Enhancing Clinical Data Infrastructure for AI Research: A Comparative Evaluation of Data Management Architectures
ABSTRACT
Background:
The rapid growth of clinical data, driven by digital technologies and high-resolution sensors, presents significant challenges for healthcare organisations aiming to support advanced AI research and improve patient care. Traditional data management approaches may struggle to handle the large, diverse and rapidly updating datasets prevalent in modern clinical environments.
Objective:
This study compares three clinical data management architectures - clinical data warehouses (cDWH), clinical data lakes (cDL) and clinical data lakehouses (cDLH) - by analysing their performance using the FAIR principles and the Big Data 5 Vs (Volume, Variety, Velocity, Veracity, Value). The aim is to provide guidance on selecting an architecture that balances robust data governance with the flexibility required for advanced analytics.
Methods:
We developed a comprehensive analysis framework that integrates aspects of data governance with technical performance criteria. A rapid literature review was conducted to synthesise evidence from multiple studies, focusing on how each architecture manages large, heterogeneous and dynamically updating clinical data. The review assessed key dimensions such as scalability, real-time processing capabilities, metadata consistency, and the technical expertise required for implementation and maintenance.
Results:
The results show that cDWHs offer strong data governance, stability and structured reporting, making them well suited for environments that require strict compliance and reliable analysis. However, they are limited in terms of real-time processing and scalability. In contrast, cDLs offer greater flexibility and cost-effective scalability for managing heterogeneous data types, although they may suffer from inconsistent metadata management and challenges in maintaining data quality. cDLHs combine the strengths of both approaches by supporting real-time data ingestion and structured querying; however, their hybrid nature requires high technical expertise and involves complex integration efforts.
Conclusions:
The optimal data management architecture for clinical applications depends on an organisation's specific needs, available resources, and strategic goals. Healthcare institutions need to weigh the trade-offs between robust data governance, operational flexibility and scalability to build future-proof infrastructures that support both clinical operations and AI research. Further research should focus on simplifying the complexity of hybrid models and improving the integration of clinical standards to improve overall system reliability and ease of implementation.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.