Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Formative Research

Date Submitted: Dec 9, 2021
Date Accepted: Jul 21, 2022

The final, peer-reviewed published version of this preprint can be found here:

Analyzing Suicide Risk From Linguistic Features in Social Media: Evaluation Study

Lao C, Lane J, Suominen H

Analyzing Suicide Risk From Linguistic Features in Social Media: Evaluation Study

JMIR Form Res 2022;6(8):e35563

DOI: 10.2196/35563

PMID: 36040781

PMCID: 9472054

Analysing Suicide Risk from Linguistic Features in Social Media: An Evaluation Study

  • Cecilia Lao; 
  • Jo Lane; 
  • Hanna Suominen

ABSTRACT

Background:

Effective suicide risk assessment and intervention is vital for suicide prevention. Although assessing such risk is best done by health care professionals, people experiencing suicide ideation may not necessarily seek help. Hence, machine learning (ML) can provide another tool for risk detection, facilitating clinical practice and prevention.

Objective:

This study explored, using statistical analyses and ML, whether computerized linguistic analysis could be applied to assess a person’s suicide risk on social media (Reddit).

Methods:

We used the University of Maryland Suicidality dataset, consisting of text posts written by users (n=866) of mental health-related forums on Reddit. Each user was classified with a suicide risk rating (no/low/moderate/severe risk) by either medical experts or crowd-sourced annotators, denoting their estimated likelihood of dying through suicide. The Linguistic Inquiry and Word Count (LIWC) and the TextStat Python package were used for linguistic analysis. LIWC targeted sentiment, thinking styles, and part-of-speech analysis, while TextStat explored readability. The Mann-Whitney U test was used to assess differences between at-risk and no-risk users, while the Kruskal-Wallis test was used for more granular analysis between risk levels. Furthermore, we identified redundancy through Spearman’s correlation analysis. The Gradient Boost, Random Forest, and Support Vector Machine (SVM) ML models were trained using 10-fold cross-validation. Evaluation was done on a hold-out test set, with the Area Under the Receiving Operator Curve (AUC) being the primary measure.

Results:

Statistically significant differences (P<.05) were identified between both the at-risk (low, moderate, and severe-risk, n=671) and no-risk groups (n=195). This was for both crowd and expert-annotated samples. Overall, at-risk users had higher median values for variables. However, a notable exception was “clout", indicating that at-risk users were less likely to engage in social posturing. “Authenticity” linearly increased with risk severity. However, there were few observable differences between the distribution of variables in the at-risk group overall. Positive correlations were present between readability metrics (ρ>0.7), parts of speech variables (ρ>0.5), and length. This implied redundancy and demonstrated the utility of aggregate features. Finally, the performance of the ML models showed applicative potential. All models performed similarly overall with AUCs from 0.66 to 0.68.

Conclusions:

In summary, our statistical analyses found linguistic features associated with suicide risk such as social posturing (e.g., authenticity), sentiment (e.g., emotional tone), and thinking styles (e.g., discrepancy). This increased understanding of social media users’ behavioral patterns and mindsets, as well as mechanisms behind ML models. Moreover, the models’ high performance demonstrated their potential to assist health care professionals in assessing and managing individuals experiencing suicide risk.


 Citation

Please cite as:

Lao C, Lane J, Suominen H

Analyzing Suicide Risk From Linguistic Features in Social Media: Evaluation Study

JMIR Form Res 2022;6(8):e35563

DOI: 10.2196/35563

PMID: 36040781

PMCID: 9472054

Per the author's request the PDF is not available.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.