Currently submitted to: JMIR Formative Research
Date Submitted: Mar 28, 2026
Open Peer Review Period: Apr 29, 2026 - Jun 24, 2026
(currently open for review)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Boundary Safety in Multi-Turn Mental Health Dialogues With Large Language Models: A Simulation-Based Evaluation Study
ABSTRACT
Background:
Large language models (LLMs) have been widely used for mental health support. However, current safety evaluations in this field are mostly limited to detecting whether LLMs output prohibited words in single-turn conversations, neglecting the gradual erosion of safety boundaries in long dialogues.
Objective:
This study aims to characterize how safety boundaries erode during multi-turn mental health conversations and to compare different pressure mechanisms that accelerate boundary violations.
Methods:
We developed a multi-turn stress-testing framework and conducted long-dialogue safety tests on three cutting-edge LLMs using two pressure methods: static progression and adaptive probing. We generated 50 virtual patient profiles and stress-tested each model through up to 20 rounds of virtual psychiatric dialogues.
Results:
Violations were common across all models, with both pressure modes producing similar violation rates. However, adaptive probing significantly advanced the time-to-breach, reducing the average number of turns from 9.21 in static progression to 4.64. Under both mechanisms, making definitive or zero-risk promises was the primary way in which boundaries were breached. Certainty reassurance accounted for 56.5% of violations in static progression and 48.5% in adaptive probing.
Conclusions:
These findings suggest that the robustness of LLM safety boundaries cannot be inferred solely through single-turn tests; it is necessary to fully consider the wear and tear on safety boundaries caused by different interaction pressures and characteristics in extended dialogues. Clinical implications include the need for multi-turn safety evaluation protocols and awareness that empathetic responses may gradually drift into boundary violations.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.