JMIR Preprints #67891: Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study

Ryan K. McBain;
Jonathan H. Cantor;
Li Ang Zhang;
Olesya Baker;
Fang Zhang;
Alyssa Halbisen;
Aaron Kofner;
Joshua Breslau;
Bradley Stein;
Ateev Mehrotra;
Hao Yu

ABSTRACT

Background:

With suicide claiming more than 720,000 lives each year globally, large language models (LLMs) are being used more frequently to provide therapeutic guidance to individuals with suicidal ideation.

Objective:

The aim of this study was to assess the competency of three widely-used LLMs to distinguish appropriate versus inappropriate responses when engaging individuals who exhibit suicidal ideation.

Methods:

This cross-sectional study assessed ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro responses to 48 items on the revised Suicidal Ideation Response Inventory (SIRI-2), compared to expert suicidologists’ responses. Participants rate responses (n=48) to hypothetical scenarios of expressed suicidal ideation on a scale from -3 (highly inappropriate) to +3 (highly appropriate). Linear regression analyzed whether mean scores differed between LLMs and suicidologists. LLM responses were converted to z-scores to identify outliers (z-score > 1.96 or < -1.96, p<0.05), compared to mean responses from suicidologists. SIRI-2 scores from LLMs were also compared to scores from health professionals in previous studies.

Results:

All three LLMs rated responses as more appropriate than ratings provided by expert suicidologists (p<0.001). In terms of z-scores, 19% of ChatGPT responses (9 of 48) were outliers compared to expert suicidologists; 11% (5 of 48) of Claude responses were outliers; and 36% (5 of 48) of Gemini responses were outliers. ChatGPT produced a final SIRI-2 score (lower is better) of 45.7, roughly equivalent to master’s level counsellors in prior studies. Claude produced a SIRI-2 score of 36.7, exceeding prior performance of mental health professionals. Gemini produced a final SIRI-2 score of 54.5, equivalent to untrained K-12 school staff.

Conclusions:

Although all three LLMs demonstrated a leniency bias when evaluating appropriate responses to suicidal ideation, two of three performed equivalent to or exceeded performance of health professionals in prior studies.

Citation

Please cite as:

McBain RK, Cantor JH, Zhang LA, Baker O, Zhang F, Halbisen A, Kofner A, Breslau J, Stein B, Mehrotra A, Yu H

Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study

J Med Internet Res 2025;27:e67891

DOI: 10.2196/67891

PMID: 40053817

PMCID: 11928068

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Oct 23, 2024

Open Peer Review Period: Oct 23, 2024 - Dec 18, 2024

Date Accepted: Jan 22, 2025

(closed for review but you can still tweet)

Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study

ABSTRACT

Citation

Copyright