Currently submitted to: JMIR Medical Informatics
Date Submitted: Apr 8, 2026
Open Peer Review Period: Apr 13, 2026 - Jun 8, 2026
(currently open for review)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Quantifying the Impact of Anonymization-Induced Clinical Data Quality Loss: A Methodological Study Using Primary Diagnosis Codes and Hospital Length of Stay
ABSTRACT
Background:
Secondary use of electronic health record (EHR) data requires robust privacy protection. k-Anonymity is widely used to enable data sharing by ensuring that each quasi-identifier combination occurs in at least k records. However, its analytical impact on clinically meaningful data structures remains insufficiently characterized, particularly for complex diagnosis-outcome relationships. Existing evaluation approaches rarely integrate distributional fidelity with inferential reproducibility, leaving an important gap in evidence-based data governance.
Objective:
This study systematically evaluated the privacy-utility trade-off at k=5, 10, and 15. We quantified how k-anonymity affects the distributional properties of primary diagnosis codes and hospital length of stay (LOS), and assessed whether diagnosis-specific LOS patterns remain reproducible after anonymization.
Methods:
We analyzed 720,359 inpatient encounters from University Hospital Mannheim collected between January 2010 and September 2024. Anonymization was performed with the ARX Data Anonymization Tool using record suppression and microaggregation at three k thresholds. Distributional distortion was assessed using the Kolmogorov-Smirnov (KS) D statistic as a descriptive divergence measure, together with quantile shifts, interquartile range compression, and tail probability changes. Categorical fidelity was evaluated using the Jaccard similarity coefficient and Cramér’s V. Inferential reproducibility was assessed with linear mixed models by comparing the intraclass correlation coefficient (ICC), diagnosis-level best linear unbiased prediction (BLUP) concordance using Spearman rank correlation and Lin’s concordance correlation coefficient (CCC), and agreement in BLUP magnitude across datasets.
Results:
The KS D statistic remained stable across anonymization levels (0.231-0.232), indicating a similar degree of LOS distributional divergence across all anonymized datasets. Median LOS increased from 4 to 5 days, the 95th percentile decreased from 24 to 17 days, and the standard deviation declined by approximately 4.9 days, while the mean changed by less than 0.025 days. Record suppression increased from 0.77% at k=5 to 2.61% at k=15. ICD-10-GM vocabulary overlap declined monotonically from Jaccard=0.624 at k=5 to 0.422 at k=15, with up to 250 three-character diagnosis categories lost. The ICC increased from 0.285 to 0.848, consistent with strong residual variance compression after anonymization rather than improved clinical signal. BLUP rank concordance remained high (Spearman ρ=0.936-0.940), whereas Lin’s CCC showed distortion in effect magnitude (0.870-0.885), and 10% of diagnosis categories showed sign reversals. Most quality loss occurred at the transition to k=5, with only incremental changes at higher thresholds.
Conclusions:
This study presents an approach for evaluating the analytical consequences of anonymization in clinical data. Although k-anonymity preserved the relative ranking of diagnosis-specific LOS effects, it substantially altered distributional shape, effect magnitude, and diagnostic vocabulary. Anonymized datasets may therefore remain usable for ordinal or comparative analyses, but can be misleading for analyses requiring faithful variance structure or absolute effect estimates. The full evaluation concept, together with a structured reporting checklist, is provided. Clinical Trial: no.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.