Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Mental Health

Date Submitted: Jun 13, 2025
Open Peer Review Period: Jun 13, 2025 - Aug 8, 2025
Date Accepted: Sep 3, 2025
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Automated Safety Plan Scoring in Outpatient Mental Health Settings Using Large Language Models: Exploratory Study

Donnelly HK, Brown GK, Green KL, Vurgun U, Hwang S, Schriver E, Steinberg M, Reilly M, Mehta H, Labouliere C, Oquendo M, Mandell D, Mowery DL

Automated Safety Plan Scoring in Outpatient Mental Health Settings Using Large Language Models: Exploratory Study

JMIR Ment Health 2026;13:e79010

DOI: 10.2196/79010

PMID: 41505705

PMCID: 12782459

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Exploring the Potential of Large Language Models for Automated Safety Plan Scoring in Outpatient Mental Health Settings

  • Hayoung K Donnelly; 
  • Gregory K Brown; 
  • Kelly L Green; 
  • Ugurcan Vurgun; 
  • Sy Hwang; 
  • Emily Schriver; 
  • Michael Steinberg; 
  • Megan Reilly; 
  • Haitisha Mehta; 
  • Christa Labouliere; 
  • Maria Oquendo; 
  • David Mandell; 
  • Danielle L Mowery

ABSTRACT

The Safety Planning Intervention (SPI) produces a plan to help manage patients’ suicide risk. High-quality safety plans – that is, those with greater fidelity to the original program model – are more effective in reducing suicide risk. We developed the Safety Planning Intervention Fidelity Rater (SPIFR), an automated tool that assesses the quality of SPI using three large language models (LLMs)—GPT-4, LLaMA 3, and o3-mini. Using 266 deidentified SPI from outpatient mental health settings in New York, LLMs analyzed four key steps: warning signs, internal coping strategies, making environments safe, and reasons for living. We compared the predictive performance of the three LLMs, optimizing scoring systems, prompts, and parameters. Results showed that LLaMA 3 and o3-mini outperformed GPT-4, with different step-specific scoring systems recommended based on weighted F1-scores. These findings highlight LLMs’ potential to provide clinicians with timely and accurate feedback on SPI practices, enhancing this evidence-based suicide prevention strategy.


 Citation

Please cite as:

Donnelly HK, Brown GK, Green KL, Vurgun U, Hwang S, Schriver E, Steinberg M, Reilly M, Mehta H, Labouliere C, Oquendo M, Mandell D, Mowery DL

Automated Safety Plan Scoring in Outpatient Mental Health Settings Using Large Language Models: Exploratory Study

JMIR Ment Health 2026;13:e79010

DOI: 10.2196/79010

PMID: 41505705

PMCID: 12782459

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.