Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Mar 21, 2022
Date Accepted: Jul 31, 2022
An Efficient Method for De-identifying Protected Health Information in Chinese Electronic Health Records
ABSTRACT
Background:
With the popularization of electronic health records (EHRs) in China, the utilization of the digitalized data has great potential for the development of real-world medical research. However, the data usually contains a lot of protected health information and the direct usage of this data may cause privacy leakage issue. The task of protected health information (PHI) de-identification in EHRs can be regarded as a named entity recognition problem. Existing methods on rule-based, machine-learning-based or deep-learning-based had been proposed to solve this problem. However, the methods still face the difficulties of insufficient Chinese EHR data and complex features of Chinese language.
Objective:
We introduce an efficient and effective model to solve the difficulties in PHI de-identification on Chinese EHRs.
Methods:
We propose a new model that merges both Tiny Bert as a text feature extraction module and a condition random field (CRF) method as a prediction module for de-identifying the PHI in Chinese medical EHRs. In addition, a hybrid data augmentation method which integrates a sentence generation strategy and a mention replacement strategy is proposed for overcoming the insufficient Chinese EHRs.
Results:
We compare our method with five baselines which utilize different Bert models as their feature extraction module. Experiment results on our collected Chinese EHRs demonstrate that our method acquires the best performance (Micro Precision: 98.7%, Micro Recall: 99.13%, Micro F1-score: 98.91%) and the highest efficiency (40% faster) among all the baseline methods.
Conclusions:
Compared to the other baselines, the efficiency advantage of Tiny Bert finetuned on our proposed augmented dataset is kept while the performance is even improved.
Citation
Per the author's request the PDF is not available.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.