Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Nov 11, 2020
Date Accepted: Jun 21, 2021
Construction of Genealogical Knowledge Graphs from Obituaries: A Multi-task neural network extraction system
ABSTRACT
Background:
Genealogical information, such as family trees, is imperative for many biomedical research, such as disease heritability and risk prediction. Researchers have utilized the policyholders and their dependents information in medical claims data and emergency contacts in Electronic Health Records (EHR) to infer the family relationships at large scale. We have previously demonstrated that online obituaries can be a novel data source for building more complete and accurate family trees.
Objective:
Aiming at supplementing EHR data with family relationships for more biomedical research, we build an end-to-end information extraction system using a multi-task based artificial neural network model to construct genealogical knowledge graphs (GKGs) from online obituaries. GKGs are enriched family trees with detailed information including age, gender, death/birth dates and residence.
Methods:
Built on a predefined family relationship map consisting of 4 types of entities (e.g., people’s name, residence, birth and death date) and 71 types of relationships, we curate a corpus containing 1,700 online obituaries from the metropolitan area of Minneapolis and St Paul in Minnesota. We also adopt the data augment technology to generate additional synthetic data to alleviate the issue of data scarcity for rare family relationships. Then, a multi-task based artificial neural network model is built to simultaneously detect the names, extract the relationships between them, and assign the attributes (e.g., birth and death dates, residence, age, and gender) to each individual. In the end, we assemble related GKGs into bigger ones by identifying people appearing in multiple obituaries.
Results:
Our system achieves the satisfying Precision (94.79%), Recall (91.45%), and F-1 measure (93.09%) on 10-fold cross-validation. We also construct 12,407 GKGs, with the largest one made up of 4 generations and 30 people.
Conclusions:
In this work, we discussed the meaning of GKGs for biomedical research, presented a new version corpus with a predefined family relationship map and the augmented training data, and proposed a multi-task deep neural system to construct and assemble GKGs. The results show our system can extract and demonstrate the potential of enriching EHR data for more genetic research. We share the source codes and system with the entire scientific community on GitHub, without the corpus for privacy protection.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.