Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Oct 23, 2024
Open Peer Review Period: Oct 23, 2024 - Dec 18, 2024
Date Accepted: Jan 22, 2025
(closed for review but you can still tweet)
Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study
ABSTRACT
Background:
With suicide claiming more than 720,000 lives each year globally, large language models (LLMs) are being used more frequently to provide therapeutic guidance to individuals with suicidal ideation.
Objective:
The aim of this study was to assess the competency of three widely-used LLMs to distinguish appropriate versus inappropriate responses when engaging individuals who exhibit suicidal ideation.
Methods:
This cross-sectional study assessed ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro responses to 48 items on the revised Suicidal Ideation Response Inventory (SIRI-2), compared to expert suicidologists’ responses. Participants rate responses (n=48) to hypothetical scenarios of expressed suicidal ideation on a scale from -3 (highly inappropriate) to +3 (highly appropriate). Linear regression analyzed whether mean scores differed between LLMs and suicidologists. LLM responses were converted to z-scores to identify outliers (z-score > 1.96 or < -1.96, p<0.05), compared to mean responses from suicidologists. SIRI-2 scores from LLMs were also compared to scores from health professionals in previous studies.
Results:
All three LLMs rated responses as more appropriate than ratings provided by expert suicidologists (p<0.001). In terms of z-scores, 19% of ChatGPT responses (9 of 48) were outliers compared to expert suicidologists; 11% (5 of 48) of Claude responses were outliers; and 36% (5 of 48) of Gemini responses were outliers. ChatGPT produced a final SIRI-2 score (lower is better) of 45.7, roughly equivalent to master’s level counsellors in prior studies. Claude produced a SIRI-2 score of 36.7, exceeding prior performance of mental health professionals. Gemini produced a final SIRI-2 score of 54.5, equivalent to untrained K-12 school staff.
Conclusions:
Although all three LLMs demonstrated a leniency bias when evaluating appropriate responses to suicidal ideation, two of three performed equivalent to or exceeded performance of health professionals in prior studies.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.