JMIR Preprints #49445: The Costs of Anonymization: A Case Study Using Clinical Data

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

The Costs of Anonymization: A Case Study Using Clinical Data

Lisa Pilgram;
Thierry Meurers;
Bradley Malin;
GCKD Investigators;
Elke Schaeffner;
Kai-Uwe Eckardt;
Fabian Prasser

ABSTRACT

Background:

Sharing data from clinical studies can accelerate scientific progress, improve transparency, and increase the potential for innovation and collaboration. However, privacy concerns remain a barrier to data sharing. Certain concerns can be addressed through the application of privacy-enhancing technologies, such as anonymization, whereby data is altered so that it is no longer reasonably related to a person. Yet such alterations have the potential to influence the dataset’s statistical properties, hence, there is a privacy-utility trade-off that must be considered.

Objective:

The goal of this study is to comprehensively evaluate the privacy-utility trade-off of anonymized data in a real-world application using data and scientific results from the German Chronic Kidney Disease (GCKD) study.

Methods:

The GCKD dataset extract for this study consists of 5,217 records and 70 variables. We followed a two-step procedure to determine variables with re-identification risks. To capture a large portion of the risk-utility space, we decided on risk thresholds ranging from 0.02 to 1. We then transformed the data via generalization and suppression, and varied the anonymization process using a generic and a use case-specific configuration. To assess the utility of the anonymized GCKD data, we applied general-purpose metrics representing data granularity and entropy, as well the reproducibility of a previously published analysis. Reproducibility was assessed by measuring the overlap of the 95% confidence interval (CI) lengths between anonymized and original results. The 95% CI overlap was assessed at the individual estimate-level and compiled into table- and dataset-level by averaging.

Results:

We observed a higher utility in terms of the 95% CI overlap, than according to general-purpose metrics. For example, granularity varied between 68.2% and 87.6% and entropy varied between 25.5% and 46.2%, whereas the average 95% CI overlap was above 90% for all risk thresholds applied. At the individual estimate-level, a non-overlapping 95% CI was detected six times across all analyses, but the overwhelming majority of estimates exhibited an overlap over 50%. The use case-specific configuration outperformed the generic configuration in terms of replicating scientific results at the same level of privacy.

Conclusions:

The benefits of use case-specific anonymization with preserved utility in the GCKD application indicate that anonymization can be highly context-specific. Anonymization processes that are tailored to specific anticipated use cases may, more generally, be a good tool to overcome the privacy-utility trade-off and can result in data from which reliable evidence is more likely to be generated.

Citation

Please cite as:

Pilgram L, Meurers T, Malin B, GCKD Investigators , Schaeffner E, Eckardt KU, Prasser F

The Costs of Anonymization: Case Study Using Clinical Data

J Med Internet Res 2024;26:e49445

DOI: 10.2196/49445

PMID: 38657232

PMCID: 11079766

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: May 30, 2023

Open Peer Review Period: May 29, 2023 - Jul 24, 2023

Date Accepted: Feb 13, 2024

(closed for review but you can still tweet)

The Costs of Anonymization: A Case Study Using Clinical Data

ABSTRACT

Citation

Copyright