Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Mar 15, 2020
Date Accepted: Sep 7, 2020

The final, peer-reviewed published version of this preprint can be found here:

Balancing Accuracy and Privacy in Federated Queries of Clinical Data Repositories: Algorithm Development and Validation

Yu YW, Weber GM

Balancing Accuracy and Privacy in Federated Queries of Clinical Data Repositories: Algorithm Development and Validation

J Med Internet Res 2020;22(11):e18735

DOI: 10.2196/18735

PMID: 33141090

PMCID: 7671849

Federated queries of clinical data repositories: balancing accuracy and privacy

  • Yun William Yu; 
  • Griffin M Weber

ABSTRACT

Background:

Over the past decade, the emergence of several large federated clinical data networks has enabled researchers to access data on millions of patients at dozens of healthcare organizations. Typically, queries are broadcast to each of the sites in the network, which then return aggregate counts of the number of matching patients. However, because patients can receive care from multiple sites in the network, simply adding the numbers frequently double-counts patients. Various methods, such as the use of trusted third parties or secure multi-party computation, have been proposed to “link” patient records across sites. However, they have large tradeoffs in accuracy and privacy, or they are not scalable to large networks.

Objective:

To enable accurate estimates of the number of patients matching a federated query, while giving strong guarantees on the amount of protected medical information revealed.

Methods:

We introduce a novel probabilistic approach to running federated network queries. It combines an algorithm called HyperLogLog with obfuscation in the form of hashing, masking, and homomorphic encryption. It is “tunable” in that it allows networks to balance accuracy versus privacy; and, it is computationally efficient even for large networks. We built a user-friendly free open source benchmarking platform to simulate federated queries in large hospital networks. Using this platform, we compare the accuracy, k-anonymity privacy risk (with k=10), and computational runtime of our algorithm to several existing techniques.

Results:

In simulated queries matching 1 to 100 million patients in a 100 hospital network, our method was significantly more accurate than adding aggregate counts, while maintaining k-anonymity. On average, it required a total of 12 kilobytes of data to be sent to the network hub and added only 5 milliseconds to the overall federated query run time. This was orders of magnitude better than other approaches which guarantee the exact answer.

Conclusions:

Using our method, it is possible to run highly accurate federated queries of clinical data repositories that both protect patient privacy and scale to large networks. Clinical Trial: N/A


 Citation

Please cite as:

Yu YW, Weber GM

Balancing Accuracy and Privacy in Federated Queries of Clinical Data Repositories: Algorithm Development and Validation

J Med Internet Res 2020;22(11):e18735

DOI: 10.2196/18735

PMID: 33141090

PMCID: 7671849

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.