Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Apr 13, 2023
Date Accepted: Nov 22, 2023

The final, peer-reviewed published version of this preprint can be found here:

OpenDeID Pipeline for Unstructured Electronic Health Record Text Notes Based on Rules and Transformers: Deidentification Algorithm Development and Validation Study

Liu J, Gupta S, Chen A, Wang CK, Mishra P, Dai HJ, Wong ZSY, Jonnagaddala J

OpenDeID Pipeline for Unstructured Electronic Health Record Text Notes Based on Rules and Transformers: Deidentification Algorithm Development and Validation Study

J Med Internet Res 2023;25:e48145

DOI: 10.2196/48145

PMID: 38055317

PMCID: 10733816

OpenDeID: A hybrid de-identification pipeline for unstructured electronic health record text notes based on rules and transformers

  • Jiaxing Liu; 
  • Shalini Gupta; 
  • Aipeng Chen; 
  • Chen-Kai Wang; 
  • Pratik Mishra; 
  • Hong-Jie Dai; 
  • Zoie Shui-Yee Wong; 
  • Jitendra Jonnagaddala

ABSTRACT

Background:

Electronic health records (EHR) in unstructured formats are valuable sources of information for research in both the clinical and biomedical domains. However, before such records can be used for research purposes, sensitive health information (SHI) must be removed in several cases to protect patient privacy. Rule-based and machine learning-based methods have shown to be effective in de-identification. However, very few studies investigated the combination of transformer-based language models and rules.

Objective:

The objective of this study is to develop a hybrid de-identification pipeline for Australian EHR text notes using rules and transformers. The study also aims to investigate the impact of pre-trained word embedding and transformers-based language models.

Methods:

In this study, we present a hybrid de-identification pipeline called OpenDeID, which is developed using an Australian multicenter EHR-based corpus called OpenDeID Corpus. The OpenDeID corpus consists of 2,100 pathology reports with 38,414 SHI entities from 1,833 patients. The OpenDeID pipeline incorporates a hybrid approach of associative rules, supervised deep learning, and pre-trained language models.

Results:

The OpenDeID achieved a best F1-score of 0.9659 by fine-tuning the Discharge Summary BioBERT model and incorporating various pre-processing and post-processing rules. The OpenDeID pipeline has been deployed at a large tertiary teaching hospital and has processed over 8,000 unstructured EHR text notes in real-time.

Conclusions:

The OpenDeID pipeline is a hybrid de-identification pipeline to de-identify SHI entities in unstructured EHR text notes. The pipeline has been evaluated on a large multicenter corpus. External validation will be undertaken as part of our future work to evaluate the effectiveness of the OpenDeID pipeline.


 Citation

Please cite as:

Liu J, Gupta S, Chen A, Wang CK, Mishra P, Dai HJ, Wong ZSY, Jonnagaddala J

OpenDeID Pipeline for Unstructured Electronic Health Record Text Notes Based on Rules and Transformers: Deidentification Algorithm Development and Validation Study

J Med Internet Res 2023;25:e48145

DOI: 10.2196/48145

PMID: 38055317

PMCID: 10733816

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.