Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Jan 22, 2021
Date Accepted: May 6, 2021

The final, peer-reviewed published version of this preprint can be found here:

Discovery of Depression-Associated Factors From a Nationwide Population-Based Survey: Epidemiological Study Using Machine Learning and Network Analysis

Nam S, Peterson TA, Seo KY, Han HW, Kang JI

Discovery of Depression-Associated Factors From a Nationwide Population-Based Survey: Epidemiological Study Using Machine Learning and Network Analysis

J Med Internet Res 2021;23(6):e27344

DOI: 10.2196/27344

PMID: 34184998

PMCID: 8277318

Discovery of Depression-Associated Factors from a Nationwide Population-Based Survey: Epidemioglocial Study Using Machine Learning and Network Analysis

  • Sangmin Nam; 
  • Thomas A Peterson; 
  • Kyoung Yul Seo; 
  • Hyun Wook Han; 
  • Jee In Kang

ABSTRACT

Background:

In epidemiological studies, finding the best subset of factors is challenging when the number of explanatory variables is large.

Objective:

Our study has two aims: first, to identify essential depression-associated factors using XGBoost machine-learning algorithm from a big survey data, the Korea National Health and Nutrition Examination Survey, 2012–2016; Second, to achieve a comprehensive understanding of multifactorial feature in depression using network analysis.

Methods:

An XGBoost model was trained and tested to classify “current depression” and “no lifetime depression” for a dataset of 120 variables for 12,596 cases. The optimal XGBoost hyperparameters were set by an automated machine learning tool (TPOT), and a high-performance sparse model was obtained by feature selection using the feature importance value of XGBoost. We performed statistical tests on the model and non-model factors using survey-weighted multiple logistic regression and drew a correlation network among factors. We also took statistical tests for the confounder or interaction effect of selected risk factors when it was suspected on the network.

Results:

The XGBoost-derived depression model consisted of 18 factors with a 0.86 area under the weighted receiver operating characteristic curves. Two non-model factors could be found using the model factors, and the factors were classified into direct (P < 0.05) or indirect (P >= 0.05) according to their statistical significance of association with depression. Perceived stress and asthma were the most remarkable risk factors, and urine specific gravity was a novel protective factor. The depression-factor network showed clusters of socioeconomic status and quality-of-life factors and suggested that educational level and sex might be predisposing factors. Indirect factors (eg, diabetes, hypercholesterolemia, smoking) were involved in confounding or interaction effects of direct factors: triglyceride level was a confounder of hypercholesterolemia and diabetes; smoking had a significant risk in females; weight gain happened in depression with diabetes.

Conclusions:

XGBoost and network analysis were useful to discover depression-related factors and their relationships and can be applied to epidemiologic studies using big survey data.


 Citation

Please cite as:

Nam S, Peterson TA, Seo KY, Han HW, Kang JI

Discovery of Depression-Associated Factors From a Nationwide Population-Based Survey: Epidemiological Study Using Machine Learning and Network Analysis

J Med Internet Res 2021;23(6):e27344

DOI: 10.2196/27344

PMID: 34184998

PMCID: 8277318

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.