Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Currently accepted at: JMIR Mental Health

Date Submitted: Dec 1, 2025
Date Accepted: Mar 1, 2026

This paper has been accepted and is currently in production.

It will appear shortly on 10.2196/88435

The final accepted version (not copyedited yet) is in this tab.

Between Help and Harm: An Evaluation of Mental Health Crisis Handling by LLMs

  • Adrian Arnaiz-Rodriguez; 
  • Miguel Baidal; 
  • Erik Derner; 
  • Jenn Layton Annable; 
  • Mark Ball; 
  • Mark Ince; 
  • Elvira Perez Vallejos; 
  • Nuria Oliver

ABSTRACT

Background:

The widespread use of chatbots powered by large language models (LLMs), has fundamentally reshaped how people seek information and advice across domains. Increasingly, these chatbots are being used in high-stakes contexts, including to provide emotional support and address mental health concerns. While LLMs can offer scalable support, their ability to safely detect and respond to acute mental health crises -- including suicidal ideation, self-harm, and violent thoughts-- remains poorly understood. Progress is hampered by the absence of unified mental health crisis taxonomies, robust annotated benchmarks, and empirical evaluations grounded in clinical best practices.

Objective:

We address these gaps by introducing: (1) a unified taxonomy of six clinically informed mental health crisis categories; (2) a curated, diverse evaluation dataset of over 2,000 user inputs drawn from 12 publicly available conversational mental health datasets and their classification to one of the mental health crisis categories; and (3) an expert-designed protocol for assessing response appropriateness. In addition, we use LLMs to automatically identify inputs indicative of a mental health crisis and conduct an auditing study of five LLMs to evaluate the appropriateness and safety of their responses.

Methods:

First, we developed a taxonomy of mental health crisis categories informed by mental health experts and an evaluation protocol based on established clinical literature. Second, we collected more than 239,000 mental-health-related user textual inputs from 12 Hugging Face datasets from which we curated a dataset with 2,252 (206 for validation and 2,046 for testing) suitable examples covering all the categories defined in the taxonomy. Third, we evaluated three LLMs on their ability to automatically classify the curated inputs to their corresponding mental health crisis categories. We selected the model with the strongest agreement with human annotators as a judge to automatically label the 2,046 examples in the curated test set. Fourth, we audited five LLMs for their ability to generate safe and appropriate responses to the 2,046 examples in the test set. Our evaluation pipeline measures both detection and response quality using a clinically informed 5-point Likert scale ranging from harmful (1) to fully appropriate (5).

Results:

Several LLMs exhibit high consistency and generally reliable behavior when responding to explicit crisis disclosures, but significant risks remain. A non-negligible proportion of responses are rated as inappropriate or harmful, especially in the self-harm and suicidal ideation categories. We also observe substantial differences in performance across models: some LLMs, namely gpt-5-nano and deepseek-v3.2-exp, achieve very low harmful-response rates, whereas others, such as gpt-4o-mini, Llama-4-Scout-17B-16E-Instruct and grok-4-fast-non-reasoning, generate markedly higher rates of unsafe outputs. All models exhibit systemic weaknesses, such as poor handling of indirect or ambiguous risk signals, heavy reliance on formulaic or inauthentic default replies, and frequent misalignment with user context.

Conclusions:

These findings underscore the urgent need for enhanced safeguards, improved mental health crisis detection, and context-aware interventions in LLM deployments. They also emphasize the central role of alignment and safety engineering practices, beyond model scale or openness, in determining crisis-response reliability. Our taxonomy, datasets, and evaluation framework lay the groundwork for ongoing research in AI-driven mental health support, helping to minimize harm and to better protect vulnerable users. We make the code and data publicly available at https://anonymous.4open.science/r/llms-mental-health-crisis-response/.


 Citation

Please cite as:

Arnaiz-Rodriguez A, Baidal M, Derner E, Annable JL, Ball M, Ince M, Perez Vallejos E, Oliver N

Between Help and Harm: An Evaluation of Mental Health Crisis Handling by LLMs

JMIR Mental Health. 01/03/2026:88435 (forthcoming/in press)

DOI: 10.2196/88435

URL: https://preprints.jmir.org/preprint/88435

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.