Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Jan 22, 2021
Date Accepted: May 6, 2021
Discovery of Depression-Associated Factors from a Nationwide Population-Based Survey: Epidemioglocial Study Using Machine Learning and Network Analysis
ABSTRACT
Background:
In epidemiological studies, finding the best subset of factors is challenging when the number of explanatory variables is large.
Objective:
Our study has two aims: first, to identify essential depression-associated factors using XGBoost machine-learning algorithm from a big survey data, the Korea National Health and Nutrition Examination Survey, 2012–2016; Second, to achieve a comprehensive understanding of multifactorial feature in depression using network analysis.
Methods:
An XGBoost model was trained and tested to classify “current depression” and “no lifetime depression” for a dataset of 120 variables for 12,596 cases. The optimal XGBoost hyperparameters were set by an automated machine learning tool (TPOT), and a high-performance sparse model was obtained by feature selection using the feature importance value of XGBoost. We performed statistical tests on the model and non-model factors using survey-weighted multiple logistic regression and drew a correlation network among factors. We also took statistical tests for the confounder or interaction effect of selected risk factors when it was suspected on the network.
Results:
The XGBoost-derived depression model consisted of 18 factors with a 0.86 area under the weighted receiver operating characteristic curves. Two non-model factors could be found using the model factors, and the factors were classified into direct (P < 0.05) or indirect (P >= 0.05) according to their statistical significance of association with depression. Perceived stress and asthma were the most remarkable risk factors, and urine specific gravity was a novel protective factor. The depression-factor network showed clusters of socioeconomic status and quality-of-life factors and suggested that educational level and sex might be predisposing factors. Indirect factors (eg, diabetes, hypercholesterolemia, smoking) were involved in confounding or interaction effects of direct factors: triglyceride level was a confounder of hypercholesterolemia and diabetes; smoking had a significant risk in females; weight gain happened in depression with diabetes.
Conclusions:
XGBoost and network analysis were useful to discover depression-related factors and their relationships and can be applied to epidemiologic studies using big survey data.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.