Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Oct 23, 2024
Open Peer Review Period: Oct 23, 2024 - Dec 18, 2024
Date Accepted: Jan 22, 2025
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study

McBain RK, Cantor JH, Zhang LA, Baker O, Zhang F, Halbisen A, Kofner A, Breslau J, Stein B, Mehrotra A, Yu H

Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study

J Med Internet Res 2025;27:e67891

DOI: 10.2196/67891

PMID: 40053817

PMCID: 11928068

Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study

  • Ryan K. McBain; 
  • Jonathan H. Cantor; 
  • Li Ang Zhang; 
  • Olesya Baker; 
  • Fang Zhang; 
  • Alyssa Halbisen; 
  • Aaron Kofner; 
  • Joshua Breslau; 
  • Bradley Stein; 
  • Ateev Mehrotra; 
  • Hao Yu

ABSTRACT

Background:

With suicide claiming more than 720,000 lives each year globally, large language models (LLMs) are being used more frequently to provide therapeutic guidance to individuals with suicidal ideation.

Objective:

The aim of this study was to assess the competency of three widely-used LLMs to distinguish appropriate versus inappropriate responses when engaging individuals who exhibit suicidal ideation.

Methods:

This cross-sectional study assessed ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro responses to 48 items on the revised Suicidal Ideation Response Inventory (SIRI-2), compared to expert suicidologists’ responses. Participants rate responses (n=48) to hypothetical scenarios of expressed suicidal ideation on a scale from -3 (highly inappropriate) to +3 (highly appropriate). Linear regression analyzed whether mean scores differed between LLMs and suicidologists. LLM responses were converted to z-scores to identify outliers (z-score > 1.96 or < -1.96, p<0.05), compared to mean responses from suicidologists. SIRI-2 scores from LLMs were also compared to scores from health professionals in previous studies.

Results:

All three LLMs rated responses as more appropriate than ratings provided by expert suicidologists (p<0.001). In terms of z-scores, 19% of ChatGPT responses (9 of 48) were outliers compared to expert suicidologists; 11% (5 of 48) of Claude responses were outliers; and 36% (5 of 48) of Gemini responses were outliers. ChatGPT produced a final SIRI-2 score (lower is better) of 45.7, roughly equivalent to master’s level counsellors in prior studies. Claude produced a SIRI-2 score of 36.7, exceeding prior performance of mental health professionals. Gemini produced a final SIRI-2 score of 54.5, equivalent to untrained K-12 school staff.

Conclusions:

Although all three LLMs demonstrated a leniency bias when evaluating appropriate responses to suicidal ideation, two of three performed equivalent to or exceeded performance of health professionals in prior studies.


 Citation

Please cite as:

McBain RK, Cantor JH, Zhang LA, Baker O, Zhang F, Halbisen A, Kofner A, Breslau J, Stein B, Mehrotra A, Yu H

Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study

J Med Internet Res 2025;27:e67891

DOI: 10.2196/67891

PMID: 40053817

PMCID: 11928068

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.