Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Mar 21, 2022
Date Accepted: Jul 31, 2022

The final, peer-reviewed published version of this preprint can be found here:

An Efficient Method for Deidentifying Protected Health Information in Chinese Electronic Health Records: Algorithm Development and Validation

Wang P, Li Y, Yang L, Li S, Li L, Zhao Z, Long S, Wang F, Wang H, Li Y, Wang C

An Efficient Method for Deidentifying Protected Health Information in Chinese Electronic Health Records: Algorithm Development and Validation

JMIR Med Inform 2022;10(8):e38154

DOI: 10.2196/38154

PMID: 36040774

PMCID: 9472063

An Efficient Method for De-identifying Protected Health Information in Chinese Electronic Health Records

  • Peng Wang; 
  • Yong Li; 
  • Liang Yang; 
  • Simin Li; 
  • Linfeng Li; 
  • Zehan Zhao; 
  • Shaopei Long; 
  • Fei Wang; 
  • Hongqian Wang; 
  • Ying Li; 
  • Chengliang Wang

ABSTRACT

Background:

With the popularization of electronic health records (EHRs) in China, the utilization of the digitalized data has great potential for the development of real-world medical research. However, the data usually contains a lot of protected health information and the direct usage of this data may cause privacy leakage issue. The task of protected health information (PHI) de-identification in EHRs can be regarded as a named entity recognition problem. Existing methods on rule-based, machine-learning-based or deep-learning-based had been proposed to solve this problem. However, the methods still face the difficulties of insufficient Chinese EHR data and complex features of Chinese language.

Objective:

We introduce an efficient and effective model to solve the difficulties in PHI de-identification on Chinese EHRs.

Methods:

We propose a new model that merges both Tiny Bert as a text feature extraction module and a condition random field (CRF) method as a prediction module for de-identifying the PHI in Chinese medical EHRs. In addition, a hybrid data augmentation method which integrates a sentence generation strategy and a mention replacement strategy is proposed for overcoming the insufficient Chinese EHRs.

Results:

We compare our method with five baselines which utilize different Bert models as their feature extraction module. Experiment results on our collected Chinese EHRs demonstrate that our method acquires the best performance (Micro Precision: 98.7%, Micro Recall: 99.13%, Micro F1-score: 98.91%) and the highest efficiency (40% faster) among all the baseline methods.

Conclusions:

Compared to the other baselines, the efficiency advantage of Tiny Bert finetuned on our proposed augmented dataset is kept while the performance is even improved.


 Citation

Please cite as:

Wang P, Li Y, Yang L, Li S, Li L, Zhao Z, Long S, Wang F, Wang H, Li Y, Wang C

An Efficient Method for Deidentifying Protected Health Information in Chinese Electronic Health Records: Algorithm Development and Validation

JMIR Med Inform 2022;10(8):e38154

DOI: 10.2196/38154

PMID: 36040774

PMCID: 9472063

Per the author's request the PDF is not available.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.