Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Jan 2, 2021
Date Accepted: May 6, 2021
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Constructing high-fidelity phenotype knowledge graphs with a fine-grained semantic information model
ABSTRACT
Background:
Phenotypes characterize clinical manifestations of disease, which provide important information for diagnosis. Therefore, constructing phenotype knowledge graphs of disease is valuable to the development of artificial intelligence in medicine. However, phenotype knowledge graphs in current knowledge bases such as WikiData and DBpedia are coarse-grained knowledge graphs, because they only consider core concepts of phenotypes but neglects details (attributes) associated with phenotypes.
Objective:
To characterize details of disease phenotypes in clinical guidelines, we proposed a fine-grained semantic information model named PhenoSSU (Semantic Structured Unit of Phenotypes).
Methods:
PhenoSSU is an "entity-attribute-value" model by its very nature, which aims to capture full semantics underlying phenotype descriptions with a series of attributes and values. 193 clinical guidelines of infectious diseases from Wikipedia were selected as the study corpus, and 12 attributes from SNOMED-CT were introduced into the PhenoSSU model based on co-occurrences of phenotype concepts and attribute values. The expressive power of the PhenoSSU model was evaluated by analyzing whether a PhenoSSU instance could capture full semantic underlying the corresponding phenotype description. To automatically construct fine-grained phenotype knowledge graphs, A hybrid strategy that firstly recognized phenotype concepts with the MetaMap tool and then predicted attribute values of phenotypes with machine learning classifiers was developed.
Results:
Fine-grained phenotype knowledge graphs of 193 infectious diseases were manually constructed with the BRAT annotation tool. It was found that the PhenoSSU model could precisely represent 89.5% (3757/4020) of phenotype descriptions in clinical guidelines. By comparison, other information models such as the Clinical Element Model and the HL7 FHIR model could only capture full semantics underlying 48.4% and 21.8% of phenotype descriptions, respectively. The hybrid strategy achieved an F1-score of 0.732 for the subtask of phenotype concept recognition and an average weighted accuracy of 0.776 for the subtask of attribute value prediction.
Conclusions:
PhenoSSU is an effective information model for the precise representation of phenotype knowledge in clinical guidelines, and machine learning can be used to improve efficiency for constructing PhenoSSU-based knowledge graphs. Our work will potentially benefit knowledge-based systems for diagnosis.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.