Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Feb 11, 2022
Date Accepted: May 12, 2022
Deep Phenotyping on Chinese Electronic Health Records by Recognizing Linguistic Patterns of Phenotypic Narratives with a Sequence Motif Discovery Tool: Algorithm Development and Validation
ABSTRACT
Background:
Phenotype information in electronic health records (EHRs) is mainly recorded in unstructured free text, which cannot be directly used for clinical research. EHR-based deep phenotyping methods can structure phenotype information in EHRs with high fidelity, making it the focus of medical informatics. However, developing a deep phenotyping method for non-English EHRs (such as Chinese EHRs) is challenging. Although numerous EHR resources exist in China, fine-grained annotation data suitable for developing deep phenotyping methods are limited. It is a great challenge to develop a deep phenotyping method for Chinese EHRs in such a low-resource scenario.
Objective:
In the study, we aimed to develop a deep phenotyping method with good generalization ability for Chinese EHRs based on limited fine-grained annotation data.
Methods:
The core of the methodology was to learn linguistic patterns of phenotype descriptions in Chinese EHRs with a sequence motif discovery tool and then perform deep phenotyping of Chinese EHRs by recognizing learned linguistic patterns in free text. Specifically, 1,000 Chinese EHRs were manually annotated based on a fine-grained information model, PhenoSSU (the Semantic Structured Unit of Phenotypes). The annotation dataset was randomly divided into a training set (70%) and a testing set (30%). The process for mining linguistic patterns could be divided into three steps: First, free text in the training set was encoded as a single-letter sequence (P: phenotype, A: attribute). Second, a biological sequence analysis tool named MEME motif discovery was used to identify motifs in the single-letter sequence. Finally, the identified motifs were reduced to a series of regular expressions representing linguistic patterns of PhenoSSU instances in Chinese EHRs. Based on the discovered linguistic patterns, we developed a deep phenotyping method for Chinese EHRs, including a deep learning–based model for named entity recognition and a pattern recognition-based method for attribute prediction.
Results:
Fifty-one sequence motifs with statistical significance were mined from 700 Chinese EHRs in the training set and were combined into six regular expressions. It was found that these six regular expressions might be learned from 134 (+/−9.7) annotated EHRs in the training set. The deep phenotyping algorithm for Chinese EHRs could recognize PhenoSSU instances with an overall accuracy of 0.844 on the test set. For the subtask of entity recognition, the algorithm achieved an F1-score of 0.898 with the BERT-BiLSTM-CRF model; for the subtask of attribute prediction, the algorithm achieved a weighted accuracy of 0.940 with the linguistic pattern-based method.
Conclusions:
We developed a simple but effective strategy to perform deep phenotyping of Chinese EHRs with limited fine-grained annotation data. Our work will promote the second use of Chinese EHRs and give inspiration to other non-English-speaking countries.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.