Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Apr 13, 2023
Date Accepted: Nov 22, 2023
OpenDeID: A hybrid de-identification pipeline for unstructured electronic health record text notes based on rules and transformers
ABSTRACT
Background:
Electronic health records (EHR) in unstructured formats are valuable sources of information for research in both the clinical and biomedical domains. However, before such records can be used for research purposes, sensitive health information (SHI) must be removed in several cases to protect patient privacy. Rule-based and machine learning-based methods have shown to be effective in de-identification. However, very few studies investigated the combination of transformer-based language models and rules.
Objective:
The objective of this study is to develop a hybrid de-identification pipeline for Australian EHR text notes using rules and transformers. The study also aims to investigate the impact of pre-trained word embedding and transformers-based language models.
Methods:
In this study, we present a hybrid de-identification pipeline called OpenDeID, which is developed using an Australian multicenter EHR-based corpus called OpenDeID Corpus. The OpenDeID corpus consists of 2,100 pathology reports with 38,414 SHI entities from 1,833 patients. The OpenDeID pipeline incorporates a hybrid approach of associative rules, supervised deep learning, and pre-trained language models.
Results:
The OpenDeID achieved a best F1-score of 0.9659 by fine-tuning the Discharge Summary BioBERT model and incorporating various pre-processing and post-processing rules. The OpenDeID pipeline has been deployed at a large tertiary teaching hospital and has processed over 8,000 unstructured EHR text notes in real-time.
Conclusions:
The OpenDeID pipeline is a hybrid de-identification pipeline to de-identify SHI entities in unstructured EHR text notes. The pipeline has been evaluated on a large multicenter corpus. External validation will be undertaken as part of our future work to evaluate the effectiveness of the OpenDeID pipeline.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.