Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: May 30, 2023
Open Peer Review Period: May 29, 2023 - Jul 24, 2023
Date Accepted: Feb 13, 2024
(closed for review but you can still tweet)
The Costs of Anonymization: A Case Study Using Clinical Data
ABSTRACT
Background:
Sharing data from clinical studies can accelerate scientific progress, improve transparency, and increase the potential for innovation and collaboration. However, privacy concerns remain a barrier to data sharing. Certain concerns can be addressed through the application of privacy-enhancing technologies, such as anonymization, whereby data is altered so that it is no longer reasonably related to a person. Yet such alterations have the potential to influence the dataset’s statistical properties, hence, there is a privacy-utility trade-off that must be considered.
Objective:
The goal of this study is to comprehensively evaluate the privacy-utility trade-off of anonymized data in a real-world application using data and scientific results from the German Chronic Kidney Disease (GCKD) study.
Methods:
The GCKD dataset extract for this study consists of 5,217 records and 70 variables. We followed a two-step procedure to determine variables with re-identification risks. To capture a large portion of the risk-utility space, we decided on risk thresholds ranging from 0.02 to 1. We then transformed the data via generalization and suppression, and varied the anonymization process using a generic and a use case-specific configuration. To assess the utility of the anonymized GCKD data, we applied general-purpose metrics representing data granularity and entropy, as well the reproducibility of a previously published analysis. Reproducibility was assessed by measuring the overlap of the 95% confidence interval (CI) lengths between anonymized and original results. The 95% CI overlap was assessed at the individual estimate-level and compiled into table- and dataset-level by averaging.
Results:
We observed a higher utility in terms of the 95% CI overlap, than according to general-purpose metrics. For example, granularity varied between 68.2% and 87.6% and entropy varied between 25.5% and 46.2%, whereas the average 95% CI overlap was above 90% for all risk thresholds applied. At the individual estimate-level, a non-overlapping 95% CI was detected six times across all analyses, but the overwhelming majority of estimates exhibited an overlap over 50%. The use case-specific configuration outperformed the generic configuration in terms of replicating scientific results at the same level of privacy.
Conclusions:
The benefits of use case-specific anonymization with preserved utility in the GCKD application indicate that anonymization can be highly context-specific. Anonymization processes that are tailored to specific anticipated use cases may, more generally, be a good tool to overcome the privacy-utility trade-off and can result in data from which reliable evidence is more likely to be generated.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.