Accepted for/Published in: JMIR Formative Research
Date Submitted: Jul 18, 2023
Date Accepted: Oct 27, 2023
Potential Schizophrenia Disease-related Genes Prediction Using Metagraph Representations Based on PPI-Keyword Network: Framework Development and Validation
ABSTRACT
Background:
Schizophrenia is a serious mental disease. With increased research funding for this disease, it has become one of the key focus in medical field. Searching for the associations between diseases and genes provides an effective way to study this complex disease which may enhance the studies of schizophrenia pathology and lead to targets for treatment.
Objective:
This study aims to identify potential schizophrenia risk genes by employing machine learning methods to extract topological characteristics of proteins and their functional roles in the PPI-Keywords(PPIK)network and understands complex disease-causing property, and proposes a PPIK-based metagraph representation approach.
Methods:
To enrich PPI network, we integrated keywords describing protein properties into the PPI network and constructed PPIK network. We extracted features that describe the topology of this network through metagraphs. We further transformed these metagraphs into vectors and represented proteins with a series of vectors. Then we trained and optimized our model using Random Forest(RF), XGBoost and LightGBM.
Results:
Comprehensive experiments have demonstrated the good performance of our proposed method with AUC between 0.7 and 0.9. It also outperformed the baselines including Random Walk with Restart (RWR), Average Commute Time (ACT)and Katz for overall disease protein prediction. Compared with PPI network, the baselines improved the performance by 8.3% in AUC on average after the complementation of keywords into PPI network, and our experiment on PPIK network demonstrated that the metagraph-based method also improved by 8.3% in AUC on average compared with the baselines. According to the comprehensive performance of the three models, we chose the best one, namely LightGBM, for disease protein prediction, with the measurements of precision, recall, F1 Score and AUC being 0.528,0.727,0.704 and 0.856 respectively. In particular, we transformed these proteins to their producer gene ID’s and identified top 20 genes as the most probable schizophrenia-risk genes, including EYA3, CNTN4, HSPA8, LRRK2, AFP, etc. We further validated these outcomes against metagraph features and evidence from literature, and made feature analysis and exploited evidence in the literature to interpret the correlation between predicted genes and diseases.
Conclusions:
The metagraph representations based on PPIK network framework turns out to be effective in potential schizophrenia risk genes identification, and the results are quite reliable as evidence can be found in the literature to support our prediction. Our approach can provide more biological insights into the pathogenesis of schizophrenia.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.