Currently submitted to: JMIR Mental Health
Date Submitted: Mar 6, 2026
Open Peer Review Period: Mar 8, 2026 - May 3, 2026
(currently open for review)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
A Large Language Model-Based Behavioral Activation Chatbot for Young People with Depression: Mixed-Methods Evaluation Using Artificial Users and Clinical Experts
ABSTRACT
Background:
Depression affects over 280 million people globally; nevertheless, access to evidence-based psychotherapy remains severely limited by workforce shortages and stigma. Large language model (LLM)-based chatbots promise to overcome the rigidity of rule-based systems; however, their ability to deliver structured psychological interventions with clinical fidelity remains largely unverified. Unlike psychotherapists, who undergo rigorous fidelity assessments using validated clinical instruments, LLM-based chatbots have not been subjected to equivalent evaluation standards.
Objective:
This study aimed to evaluate the clinical fidelity of an LLM-based chatbot delivering behavioral activation for depression to young people and to identify limitations and opportunities for refinement through clinical expert assessment.
Methods:
We developed an LLM-based chatbot (GPT-4o) by implementing a seven-phase behavioral activation protocol for young people aged 14–29 years with depressive symptoms. We created 48 artificial users (GPT-4o) derived from clinical patient vignettes, systematically varied across seven characteristics, including depression severity, gender, and attitudes toward mental health chatbots. Ten licensed psychotherapists or advanced psychotherapy trainees (mean age 30.1 years, SD 4.12; 70% female) independently assessed sessions using the Quality of Behavioral Activation Scale (Q-BAS), a validated 14-item fidelity instrument (0–6 scale; ≥3 indicates satisfactory delivery), supplemented by therapeutic capability ratings and qualitative feedback.
Results:
The chatbot completed all seven intervention phases across all 48 sessions. The mean holistic session quality rating was 3.94 (SD 1.23) and the mean Q-BAS rating was 4.03 (SD 1.18). Thirteen of the 14 components exceeded the satisfactory threshold. Component adequacy rates ranged from 97.9% for mood assessment (n=47/48) to 56.2% for explaining positive reinforcement (n=27/48). The highest-rated components were mood assessment (M=5.42, SD=1.09) and planning activities (M=4.98, SD=1.41); the lowest were explaining positive reinforcement (M=2.92, SD=2.30) and encouraging observation of activity–mood connections (M=3.02, SD=2.04). Variance decomposition showed that 36.8% of the Q-BAS variance was attributable to session differences (variance=1.27, 95% CI 0.82–2.02) and 12.0% to component differences (variance=0.42, 95% CI 0.18–0.98). Message safety received the highest therapeutic capability rating (M=6.90, SD=0.37), with 92% of sessions rated at maximum. Therapeutic rapport received the lowest rating (M=5.13, SD=1.45). Artificial users with negative chatbot attitudes were rated as significantly more authentic than those with positive attitudes (W=387.50, p=.036), without significantly affecting fidelity scores (p=.275). Qualitatively, psychotherapists consistently identified insufficient clinical reasoning as the primary limitation, particularly the failure to verify whether activities and rewards were therapeutically appropriate.
Conclusions:
Although large language model-based chatbots can execute structured therapeutic protocols with satisfactory fidelity while maintaining high message safety, clinical reasoning remains a critical gap. Prompt-level refinements, including granular task breakdown, template-based content, embedded clinical decision rules, and explicit redirection mechanisms, were proposed to address the identified shortcomings.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.