JMIR Preprints #38154: An Efficient Method for De-identifying Protected Health Information in Chinese Electronic Health Records

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

An Efficient Method for De-identifying Protected Health Information in Chinese Electronic Health Records

Peng Wang;
Yong Li;
Liang Yang;
Simin Li;
Linfeng Li;
Zehan Zhao;
Shaopei Long;
Fei Wang;
Hongqian Wang;
Ying Li;
Chengliang Wang

ABSTRACT

Background:

With the popularization of electronic health records (EHRs) in China, the utilization of the digitalized data has great potential for the development of real-world medical research. However, the data usually contains a lot of protected health information and the direct usage of this data may cause privacy leakage issue. The task of protected health information (PHI) de-identification in EHRs can be regarded as a named entity recognition problem. Existing methods on rule-based, machine-learning-based or deep-learning-based had been proposed to solve this problem. However, the methods still face the difficulties of insufficient Chinese EHR data and complex features of Chinese language.

Objective:

We introduce an efficient and effective model to solve the difficulties in PHI de-identification on Chinese EHRs.

Methods:

We propose a new model that merges both Tiny Bert as a text feature extraction module and a condition random field (CRF) method as a prediction module for de-identifying the PHI in Chinese medical EHRs. In addition, a hybrid data augmentation method which integrates a sentence generation strategy and a mention replacement strategy is proposed for overcoming the insufficient Chinese EHRs.

Results:

We compare our method with five baselines which utilize different Bert models as their feature extraction module. Experiment results on our collected Chinese EHRs demonstrate that our method acquires the best performance (Micro Precision: 98.7%, Micro Recall: 99.13%, Micro F1-score: 98.91%) and the highest efficiency (40% faster) among all the baseline methods.

Conclusions:

Compared to the other baselines, the efficiency advantage of Tiny Bert finetuned on our proposed augmented dataset is kept while the performance is even improved.

Citation

Please cite as:

Wang P, Li Y, Yang L, Li S, Li L, Zhao Z, Long S, Wang F, Wang H, Li Y, Wang C

An Efficient Method for Deidentifying Protected Health Information in Chinese Electronic Health Records: Algorithm Development and Validation

JMIR Med Inform 2022;10(8):e38154

DOI: 10.2196/38154

PMID: 36040774

PMCID: 9472063

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Mar 21, 2022

Date Accepted: Jul 31, 2022

An Efficient Method for De-identifying Protected Health Information in Chinese Electronic Health Records

ABSTRACT

Citation

Copyright

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Mar 21, 2022

Date Accepted: Jul 31, 2022

An Efficient Method for De-identifying Protected Health Information in Chinese Electronic Health Records

ABSTRACT

Citation

Per the author's request the PDF is not available.

Copyright