JMIR Preprints #23139: A Method for Evaluating Identity Disclosure Risk in Fully Synthetic Health Data

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

A Method for Evaluating Identity Disclosure Risk in Fully Synthetic Health Data

Khaled El Emam;
Lucy Mosquera;
Jason Bass

ABSTRACT

Background:

While there has been growing interest in data synthesis for enabling the sharing of data for secondary analysis, there is a need for a comprehensive privacy risk model for fully synthetic data: if the generative models have been overfit then it is possible to identify individuals from synthetic data and learn something new about them.

Objective:

The purpose of this study is to develop and apply a methodology for evaluating the identity disclosure risks of fully synthetic data.

Methods:

A full risk model is presented which evaluates both identity disclosure and the ability of an adversary to learn something new if there is a match between a synthetic record and a real person. We term this meaningful identity disclosure risk. The model is applied on samples from the Washington state hospital discharge database (2007) and the Canadian COVID-19 cases database. Both of these datasets were synthesized using a sequential decision tree process commonly used to synthesize health and social science data.

Results:

The meaningful identity disclosure risk for both of these synthesized samples were below the commonly used 0.09 risk threshold (0.0198 and 0.0086 respectively) and 5x and 10x lower than the risk values for the original datasets.

Conclusions:

We have presented a comprehensive identity disclosure risk model for fully synthetic data. The results for this synthesis method on two datasets demonstrate that synthesis can reduce meaningful identity disclosure risks considerably. The risk model can be applied in the future to evaluate the privacy of synthetic data.

Citation

Please cite as:

El Emam K, Mosquera L, Bass J

Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation

J Med Internet Res 2020;22(11):e23139

DOI: 10.2196/23139

PMID: 33196453

PMCID: 7704280

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Aug 2, 2020

Date Accepted: Oct 10, 2020

A Method for Evaluating Identity Disclosure Risk in Fully Synthetic Health Data

ABSTRACT

Citation

Copyright