Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Apr 11, 2023
Date Accepted: Aug 25, 2023
Date Submitted to PubMed: Aug 26, 2023
Large-scale biomedical relation extraction across diverse types: model development, and usability study on COVID-19
ABSTRACT
Background:
The relations between biomedical entities are complex and diverse. Biomedical relation extraction (RE) can provide support for downstream tasks including the automatic construction of the knowledge graph (KG), to meet the application needs of knowledge discovery in the biomedical field.
Objective:
However, there is still a lack of investigation for model exploration and scenario application on large-scale data with complex relation categories, which is practical for research hot button topics with enormous amounts of literature like COVID-19. This paper aims to streamline and improve literature analysis by large-scale RE to optimize knowledge mining.
Methods:
Datasets containing entity semantic data at different levels are constructed based on a large-scale RE dataset and UMLS to evaluate the effect of entity information on RE. We then conducted performance analysis on different model architectures and domain models, and we also proposed continued pre-training strategies and ensemble modeling to obtain the best RE performance to provide functional RE tools. We also applied RE to the COVID-19 corpus with several cases to assess the applicability of our approach.
Results:
The performance analysis revealed that RE achieves the best performance if the detailed semantic type is provided. For a single model, PubMedBERT with our continued pre-training strategy performed the best with an F1 score of 0.8998, while the ensemble model outperformed all single models with an average F1 score of 0.9002. The COVID-19 use cases demonstrated the biological significance of RE, with our model constructing a KG that revealed several novel drug paths. This study also retrieved drug sets from non-long/long COVID separately and constructed relational triples between coronavirus-specific entities based on the RE.
Conclusions:
The optimized RE models for diverse relation types are developed based on performance analysis. Our RE application provided a proof-of-concept demonstration of how large-scale literature mining can be leveraged to facilitate novel scientific research.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.