Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Aug 30, 2020
Date Accepted: Feb 19, 2021

The final, peer-reviewed published version of this preprint can be found here:

A Hybrid Model for Family History Information Identification and Relation Extraction: Development and Evaluation of an End-to-End Information Extraction System

Kim Y, Heider PM, Lally IRH, Meystre SM

A Hybrid Model for Family History Information Identification and Relation Extraction: Development and Evaluation of an End-to-End Information Extraction System

JMIR Med Inform 2021;9(4):e22797

DOI: 10.2196/22797

PMID: 33885370

PMCID: 8103307

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

A Hybrid Model for Entity Identification and Relation Extraction of Family History Information

  • Youngjun Kim; 
  • Paul M Heider; 
  • Isabel R H Lally; 
  • Stéphane M Meystre

ABSTRACT

Background:

Family history information is important to assess the risk of inherited medical conditions. Natural language processing has the potential to extract this information from unstructured free-text notes to improve patient care and decision-making. We describe the end-to-end information extraction system the Medical University of South Carolina team developed when participating in the 2019 n2c2/OHNLP shared task.

Objective:

This task involves identifying mentions of family members and observations in electronic health record text notes, and recognizing the relations between family members, observations, and living status. Our system aims to achieve a high level of performance by integrating heuristics and advanced information extraction methods. Our efforts also include improving the performance of two subtasks by exploiting additional labeled data and clinical text-based embedding models.

Methods:

We present a hybrid method that combines machine learning and rule-based approaches. We implemented an end-to-end system with multiple information extraction and attribute classification components. For entity identification, we trained bidirectional long short-term memory deep learning models. These models incorporated static word embeddings and context-dependent embeddings. We created a voting ensemble that combined the predictions of all individual models. For relation extraction, we trained two relation extraction models. The first model determined the living status of each family member. The second model identified observations associated with each family member. We implemented online gradient descent models to extract related entity pairs. As part of post-challenge efforts, we used the BioCreative/OHNLP 2018 corpus and trained new models with the union of these two data sets. We also pre-trained language models using clinical notes from the MIMIC-III clinical database.

Results:

The voting ensemble achieved better performance than individual classifiers. In the entity identification task, the best performing system reached a precision of 78.90% and a recall of 83.84%. Our NLP system for entity identification and relation extraction ranked 3rd and 4th respectively in the challenge. Our end-to-end pipeline system substantially benefited from the combination of the two data sets. Compared to our official submission, the revised system yielded significantly better performance (p < 0.05) with F1-scores of 86.02% and 72.48% for entity identification and relation extraction, respectively.

Conclusions:

We demonstrated that a hybrid model could be used to successfully extract family history information recorded in unstructured free-text notes. In this study, our approach of entity identification as a sequence labeling problem produced satisfactory results. Our post-challenge efforts significantly improved performance by leveraging additional labeled data and using word vector representations learned from large collections of clinical notes.


 Citation

Please cite as:

Kim Y, Heider PM, Lally IRH, Meystre SM

A Hybrid Model for Family History Information Identification and Relation Extraction: Development and Evaluation of an End-to-End Information Extraction System

JMIR Med Inform 2021;9(4):e22797

DOI: 10.2196/22797

PMID: 33885370

PMCID: 8103307

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.