Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Oct 12, 2020
Date Accepted: Apr 11, 2021
Evaluation of Privacy Risks of Personal Health Identifier and Quasi-Identifier in Distributed Research Network: A Development and Validation Study
ABSTRACT
Background:
Privacy should be protected in medical data, which include patient information. A distributed research network (DRN) is one of challenges in privacy protection and in the encouragement of multi-institutional clinical research. A DRN standardizes multi-institutional data into a common structure and terminology called a common data model (CDM), and only shares analysis results. It is necessary to measure how a DRN protects patient information privacy even without sharing data in practice.
Objective:
This study aims to quantify the privacy risk of a DRN by comparing different de-identification levels focusing on a personal health identifier (PHI) and quasi-identifier (QI).
Methods:
We detected the PHIs and QIs in an Observational Medical Outcomes Partners (OMOP) CDM as threatening privacy, based on the Health Insurance Portability and Accountability Act of 1996 (HIPPA) 18 identifiers and previous studies. The Synthetic Public Use File 5 Percent (SynPUF5PCT), which is a public dataset of the OMOP CDM, was used to measure privacy risk by targeting 16 PHIs and 12 QIs. To compare the difference in privacy risk according to the privacy policy, a risk comparison between the limited and safe harbor datasets from the SynPUF5PCT was conducted. The privacy risk for the target PHI and QI was measured in two datasets using the minimum cell size and equivalence class method.
Results:
Compared with the limited dataset, the privacy risk of the safe harbor dataset was reduced by an average of 31.448% and 73.798% risk for PHI and QI, respectively, with a minimum cell size of one that could be uniquely distinguishable in a set of records with common attributes. In the PHIs, the National Provider Identifier (NPI) variable was the most reduced by 71.236% (from 71.244% to 0.007%), and the date of death variable was the least reduced by 11.428% (from 98.787% to 87.359%). Furthermore, the maximum size of equivalence class, which has the largest size of an indistinguishable set of records with common attributes, increased by 771 on average. In the QIs, the minimum cell sizes of 1–5, which represent each group of indistinguishable records, reduced privacy risk by an average of 62.796% in the safe harbor dataset. In particular, the death scenario was reduced the most by 99.212%, and diagnosis scenario was least reduced by 29.869% with a minimum cell size of one.
Conclusions:
In this study, we verified and quantified the privacy risk of patient information in the DRN of PHI and QI, and as a result, we confirmed that PHIs and QIs, which increase privacy risk, exist in the DRN. Although this study used only limited PHI and QI for verification, the privacy limitations found in this study could be used as a quality measurement for de-identification, thereby increasing DRN safety.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.