JMIR Preprints #48443: De-Identification of Chinese-English Code-Mixed Clinical Text Using Pre-Trained Language Models and In Context-Learning of Large Language Models

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)

De-Identification of Chinese-English Code-Mixed Clinical Text Using Pre-Trained Language Models and In Context-Learning of Large Language Models

Ching-Tai Chen;
You-Qian Lee;
Chien-Chan Chen;
Pei-Tsz Chen;
Chi-Shin Wu;
Hong-Jie Dai

ABSTRACT

Background:

The widespread use of electronic health records in clinical and biomedical fields makes the removal of protected health information (PHI) essential to maintain privacy. However, a significant portion of information is recorded in unstructured textual form posing a challenge to de-identify. In countries like Taiwan, medical records could be written in a mixture of more than one language, referred to as code-mixing (CM). Most current clinical natural language processing techniques are designed for monolingual texts, and there is a need to address the de-identification of CM texts.

Objective:

The aim of this study was to investigate the effectiveness and underlying mechanism of fine-tuned PLMs in identifying PHIs in CM context. Additionally, we also aimed to evaluate the potential of prompting LLMs in recognizing PHIs in a zero-shot manner.

Methods:

We compiled the first clinical CM deidentification dataset consisting of texts written in Chinese and English. We explored the effectiveness of fine-tuning pre-trained language models (PLMs) in recognizing PHIs in CM content, focusing on whether PLMs exploit naming regularity and mention coverage to achieve superior performance by probing the developed models’ outputs to examine their decision-making process. Furthermore, we investigated the potential of prompt-based in-context learning of large language models (LLMs) in recognizing PHIs in CM text.

Results:

The developed methods were evaluated on a CM de-identification corpus of 1,700 discharge summaries. We observed that different PHI types had their preference in their occurrence within the different types of language-mixed sentences, and PLMs could effectively recognize PHIs by exploiting the learned name regularity. However, the models may exhibit suboptimal results when regularity was weak or mentions contain unknown words that the representations cannot generate well. We also found that the availability of CM training instances is essential for the model’s performance. Furthermore, LLM-based de-identification method is a feasible and appealing approach that can be controlled and enhanced through natural language prompts.

Conclusions:

The study contributes to understanding the underlying mechanism of PLMs in addressing the de-identification process in CM context and highlights the significance of incorporating CM training instances into the model training phase. The LLM-based de-identification method is a feasible approach, but carefully crafted prompts are essential to avoid unwanted output. However, the use of such methods in the hospital setting requires careful consideration of data security and privacy concerns. Further research could explore the augmentation of PLMs and LLMs with external knowledge to improve their strength in recognizing rare PHIs.

Citation

Please cite as:

Chen CT, Lee YQ, Chen CC, Chen PT, Wu CS, Dai HJ

Unlocking the Secrets Behind Advanced Artificial Intelligence Language Models in Deidentifying Chinese-English Mixed Clinical Text: Development and Validation Study

J Med Internet Res 2024;26:e48443

DOI: 10.2196/48443

PMID: 38271060

PMCID: 10853853

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Apr 24, 2023

Date Accepted: Dec 5, 2023

De-Identification of Chinese-English Code-Mixed Clinical Text Using Pre-Trained Language Models and In Context-Learning of Large Language Models

ABSTRACT

Citation

Copyright