Accepted for/Published in: JMIR AI
Date Submitted: Jul 9, 2025
Date Accepted: Jan 27, 2026
Date Submitted to PubMed: Jan 27, 2026
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
LLM Chatbots and Agentic AI Counselors: A Systematic Review of LLM-Based Mental Health Interventions
ABSTRACT
Background:
Large Language Models (LLMs) are increasingly powering conversational agents in digital mental health interventions (DMHI). Despite their growing use, there remains a lack of clarity regarding the models’ development, evaluation, and deployment processes, as well as their alignment with ethical and clinical standards.
Objective:
This systematic review aims to examine the design, implementation, and evaluation of LLM-based mental health chatbots and agentic AI systems, focusing on their underlying model architectures, development methodologies, evaluation strategies, and deployment approaches.
Methods:
We conducted a systematic search of peer-reviewed publications and preprints through databases including PubMed, IEEE Xplore, ACL Anthology, and arXiv. Twenty studies were selected based on predefined eligibility criteria. Data extraction covered LLM types, training approaches, system architecture, evaluation metrics, and deployment context. Studies were assessed for methodological rigor, including whether external validation or clinical trial registration was conducted.
Results:
Among the 20 included studies 45% (n=9) employed GPT-based models (GPT-2, GPT-3, GPT-4), while only 55% (n=11) used fine-tuned or domain-specific variants (e.g., ClinicalT5, LLaMA, ChatGLM, Qwen). Chatbot deployment types included standalone applications (65%, n=13), virtual agents (25%, n=5), and embedded platforms (15%, n=3). Evaluation strategies were predominantly qualitative (65%, n=13), including thematic analysis and rubric-based scoring, while 90% (n=19) also used quantitative metrics such as BLEU, ROUGE, and perplexity. Only 10% (n=2) conducted any form of external validation, and none reported psychometric validation or standardized clinical outcome measurement. No included study reported trial registration or randomized controlled trial data.
Conclusions:
LLM-based mental health systems show potential to enhance user engagement and personalization in DMHI, particularly through adaptive, multi-modal agent structures. However, the current literature reflects limited methodological rigor, with gaps in external validation, standardization, and ethical compliance. To ensure safe and effective deployment, future research should prioritize clinical validation, robust evaluation frameworks, and transparent governance of AI behaviors.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.