Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Dec 31, 2019
Date Accepted: May 5, 2020
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
An automatic construction of depressing-domain lexicon based on microblogs
ABSTRACT
Background:
According to the WHO report in 2017, there will be almost one depression patient among every 20 people in China. Diagnosis of depression, however, is usually a hard work in clinical detection due to slow observation, expensive cost and patient resistance. Meanwhile, things are changing with the rapid emergence of social media. People tend to share their daily life and disclose inner feelings frequently, making it possible to have an effective mental detection using rich text information.
Objective:
However, in most of the researches so far, a lack of an efficient depressing-domain lexicon often leads to a bad result. To improve online depression detection, we aim to construct a lexicon in depressing domain based on microblogs we collected. Effective methods are also needed to obtain an automatic construction.
Methods:
We apply an auto-construction of depressing-domain lexicon that can be used for further detection using Word2Vec, semantic relationship graph and Label Propagation Algorithm (LPA). Those two methods combined can cover prior knowledge base and corpus base in specific corpus during construction. The lexicon is obtained based on 111,052 microblogs from 1,868 depressed and non-depressed users. There is no effective lexicon in other studies, and our construction method will make a great contribution in depressing domain.
Results:
In particular, we establish a well-labeled benchmark dataset of depressed and non-depressed. Experiment results show that in terms of F1 value, our auto-construction method performs 5% better than the baselines, and is more effective and steadier. When applied to detection models like Naive Bayes, Logistic Regression and Random Forest, our lexicon helps models outperform by 3-8%, and is able to improve the final accuracy for depression diagnosis in advanced detection.
Conclusions:
Lots of researches ignore the depressing-domain words on social media which can contribute greatly to the diagnosis. Our lexicon is proved to be a meaningful input of classification algorithms, providing insights in depressive status of test objects, so as to improve the final accuracy.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.