Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Jan 4, 2021
Date Accepted: Sep 6, 2021
Local differential privacy in medical domain to protect sensitive information: Algorithm Development and Real-World Validation
ABSTRACT
Background:
Privacy is of increasing interest in the present big data era, particularly regarding medical data. Specifically, differential privacy has emerged as the standard method for privacy-preserving data analysis and data publishing.
Objective:
We applied differential privacy to medical data with diverse parameters and checked the (i) feasibility of our algorithms with synthetic data and (ii) the balance between data privacy and utility, using machine learning techniques.
Methods:
All data were normalized to range between –1 and 1, and the bounded Laplacian method was applied to prevent the generation of out-of-bound values after applying the differential privacy algorithm. To preserve the categorical variables’ cardinality, we performed post-processing via discretization. The algorithm was evaluated using both synthetic and real-world data (eICU Collaborative Research Database). We evaluated the difference between the original data and perturbated data using misclassification rates and the mean squared error, for categorical data and continuous data, respectively. Further, we compared the performances of classification models that predict in-hospital mortality using real-world data.
Results:
The misclassification rate of categorical variables ranged between 0.49 and 0.85, when epsilon was 0.1, and it converged to 0 when epsilon was increased. When epsilon was between 102 and 103, the misclassification rate rapidly dropped to 0. Similarly, the mean squared error of continuous variables decreased as epsilon increased. The performance of the model developed from perturbed data converged to that of the model developed from original data as epsilon increased. In particular, the accuracy of a random forest model developed from original data was 0.801, and it ranged from 0.757 to 0.81 when epsilon was 0.1 and 10,000.
Conclusions:
We applied local differential privacy to medical domain data, which are diverse and high-dimensional. Higher noise may offer enhanced privacy, but it simultaneously hinders utility. We should choose an appropriate degree of noise for data perturbation to balance privacy and utility depending on specific situations.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.