Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Data

Date Submitted: Oct 10, 2023
Date Accepted: Jul 2, 2024

The final, peer-reviewed published version of this preprint can be found here:

Development of Depression Data Sets and a Language Model for Depression Detection: Mixed Methods Study

Tumaliuan FBC, Grepo-Jalao L, Jalao ER

Development of Depression Data Sets and a Language Model for Depression Detection: Mixed Methods Study

JMIR Data 2024;5:e53365

DOI: 10.2196/53365

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Development of Philippine depression datasets and language model for depression detection

  • Faye Beatriz Cabiao Tumaliuan; 
  • Lorelie Grepo-Jalao; 
  • Eugene Rex Jalao

ABSTRACT

Background:

Depression detection in social media has gained attention in recent years with the help of Natural Language Processing (NLP) techniques.

Objective:

To develop solutions to identify depression patterns through NLP and machine learning, valid datasets need to be constructed.

Methods:

The proposed process included the implementation of clinical screening methods with the help of clinical psychologists in the recruitment of study participants. A total of 76 participants were assessed by clinical psychologists and provided their Twitter data: 61 with depression and 15 with no depression. A dataset was developed consisting of depression symptom annotated tweets with 13 depression categories. These were created through manual annotation in a process constructed, guided, and validated by clinical psychologists.

Results:

Three (3) annotators completed the process for a total of 86,163 tweets, resulting in a substantial inter-annotator agreement score of 0.736 using Fleiss kappa, and a 95.71% psychologist validation score. A word2vec language model was developed using Filipino and English datasets to create a 300-feature word embedding that can be used in various machine learning techniques for NLP.

Conclusions:

This study contributes to depression research by constructing depression datasets from social media to aid NLP in the Philippine setting.


 Citation

Please cite as:

Tumaliuan FBC, Grepo-Jalao L, Jalao ER

Development of Depression Data Sets and a Language Model for Depression Detection: Mixed Methods Study

JMIR Data 2024;5:e53365

DOI: 10.2196/53365

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.