Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Mar 27, 2020
Date Accepted: Aug 23, 2020

The final, peer-reviewed published version of this preprint can be found here:

Secure Record Linkage of Large Health Data Sets: Evaluation of a Hybrid Cloud Model

Brown AP, Randall SM

Secure Record Linkage of Large Health Data Sets: Evaluation of a Hybrid Cloud Model

JMIR Med Inform 2020;8(9):e18920

DOI: 10.2196/18920

PMID: 32965236

PMCID: 7542414

A hybrid cloud model for secure record linkage of large health datasets

  • Adrian P Brown; 
  • Sean M Randall

ABSTRACT

Background:

The linking of administrative data across agencies provides the capability to investigate many health and social issues with the potential to deliver significant public benefit. As the demand for data linkage increases, one of the main challenges will be to ensure systems are scalable, as record-level linkage is computationally expensive. Despite its advantages, the use of cloud computing resources for linkage purposes is scarce, with storage of identifiable information on cloud infrastructure assessed as high risk by data custodians.

Objective:

This paper presents a model for record linkage that utilises cloud computing capabilities while assuring custodians that identifiable datasets remain secure and local. This new hybrid cloud model includes privacy-preserving record linkage techniques and container-based batch processing to satisfy its tenets.

Methods:

A model for data linkage that incorporates cloud computing was created based on a set of design principles that aim to maximise privacy while leveraging the capabilities of cloud computing. An evaluation of this model was then conducted with a prototype implementation using large synthetic datasets representative of administrative health data.

Results:

The cloud model keeps identifiers on-premises and uses privacy-preserved versions of these identifiers to run all linkage computation on cloud infrastructure. Our prototype used a managed container cluster in AWS to distribute the computation using existing linkage software. Although the cost for computation was relatively inexpensive, the use of existing software resulted in an overhead of processing of approximately 35%.

Conclusions:

Further work is required to develop optimised algorithms for distributed matching. However, the result of our experimental evaluation shows the operational feasibility of such a model and the exciting opportunities for advancing analysis of linkage outputs.


 Citation

Please cite as:

Brown AP, Randall SM

Secure Record Linkage of Large Health Data Sets: Evaluation of a Hybrid Cloud Model

JMIR Med Inform 2020;8(9):e18920

DOI: 10.2196/18920

PMID: 32965236

PMCID: 7542414

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.