JMIR Preprints #22397: Corpus-based analysis of general-purpose sentiment lexicons for suicide risk assessment in electronic health records

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Corpus-based analysis of general-purpose sentiment lexicons for suicide risk assessment in electronic health records

André Bittar;
Sumithra Velupillai;
Angus Roberts;
Rina Dutta

ABSTRACT

Background:

Suicide is a serious public health issue, accounting for 1.4% of all deaths worldwide. Current risk assessment tools are reported as being little better than chance in predicting suicide. New methods studying dynamic features in electronic health records (EHRs) are being increasingly explored. One avenue of research involves using sentiment analysis to examine clinicians’ subjective judgements when reporting on patients. Several recent studies have used general-purpose sentiment analysis tools to automatically identify negative and positive words within EHRs to test correlations between sentiment extracted from the texts and specific medical outcomes (e.g. risk of suicide or in-hospital mortality). However, little attention has been paid to analysing the specific words identified by general-purpose sentiment lexicons when applied to EHR corpora.

Objective:

In this study, we aimed to quantitively and qualitatively evaluate the coverage of 6 general-purpose sentiment lexicons against a corpus of EHR texts in order to ascertain the extent to which such lexical resources are fit for use in suicide risk assessment.

Methods:

The data for this study was a corpus of EHR texts made up of two sub-corpora drawn from a case-control study comparing clinical notes written over the period leading up to a suicide attempt (cases) with those not preceding such an attempt (controls). We calculated word frequency distributions within each sub-corpus to identify representative keywords for both case and control sub-corpora. We quantified the relative coverage of the 6 lexicons with respect to this list of representative keywords in terms of weighted precision, recall and F-score.

Results:

The 6 lexicons achieved reasonable precision, but very low recall. Furthermore, many of the most representative keywords in the suicide-related (case) sub-corpus were not identified by any of the lexicons and the sentiment-bearing status of these keywords is debatable.

Conclusions:

Our findings indicate that these 6 lexicons are not optimal for use in suicide risk assessment. We propose a set of guidelines for the creation of more suitable lexical resources for distinguishing suicide-related from non-suicide-related EHR texts.