Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR AI

Date Submitted: Jul 9, 2025
Date Accepted: Jan 27, 2026
Date Submitted to PubMed: Jan 27, 2026

The final, peer-reviewed published version of this preprint can be found here:

Large Language Model–Based Chatbots and Agentic AI for Mental Health Counseling: Systematic Review of Methodologies, Evaluation Frameworks, and Ethical Safeguards

Cho HN, Zheng K, Wang J, Hu D

Large Language Model–Based Chatbots and Agentic AI for Mental Health Counseling: Systematic Review of Methodologies, Evaluation Frameworks, and Ethical Safeguards

JMIR AI 2026;5:e80348

DOI: 10.2196/80348

PMID: 41592221

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

LLM Chatbots and Agentic AI Counselors: A Systematic Review of LLM-Based Mental Health Interventions

  • Ha Na Cho; 
  • Kai Zheng; 
  • Jiayuan Wang; 
  • Di Hu

ABSTRACT

Background:

Large Language Models (LLMs) are increasingly powering conversational agents in digital mental health interventions (DMHI). Despite their growing use, there remains a lack of clarity regarding the models’ development, evaluation, and deployment processes, as well as their alignment with ethical and clinical standards.

Objective:

This systematic review aims to examine the design, implementation, and evaluation of LLM-based mental health chatbots and agentic AI systems, focusing on their underlying model architectures, development methodologies, evaluation strategies, and deployment approaches.

Methods:

We conducted a systematic search of peer-reviewed publications and preprints through databases including PubMed, IEEE Xplore, ACL Anthology, and arXiv. Twenty studies were selected based on predefined eligibility criteria. Data extraction covered LLM types, training approaches, system architecture, evaluation metrics, and deployment context. Studies were assessed for methodological rigor, including whether external validation or clinical trial registration was conducted.

Results:

Among the 20 included studies 45% (n=9) employed GPT-based models (GPT-2, GPT-3, GPT-4), while only 55% (n=11) used fine-tuned or domain-specific variants (e.g., ClinicalT5, LLaMA, ChatGLM, Qwen). Chatbot deployment types included standalone applications (65%, n=13), virtual agents (25%, n=5), and embedded platforms (15%, n=3). Evaluation strategies were predominantly qualitative (65%, n=13), including thematic analysis and rubric-based scoring, while 90% (n=19) also used quantitative metrics such as BLEU, ROUGE, and perplexity. Only 10% (n=2) conducted any form of external validation, and none reported psychometric validation or standardized clinical outcome measurement. No included study reported trial registration or randomized controlled trial data.

Conclusions:

LLM-based mental health systems show potential to enhance user engagement and personalization in DMHI, particularly through adaptive, multi-modal agent structures. However, the current literature reflects limited methodological rigor, with gaps in external validation, standardization, and ethical compliance. To ensure safe and effective deployment, future research should prioritize clinical validation, robust evaluation frameworks, and transparent governance of AI behaviors.


 Citation

Please cite as:

Cho HN, Zheng K, Wang J, Hu D

Large Language Model–Based Chatbots and Agentic AI for Mental Health Counseling: Systematic Review of Methodologies, Evaluation Frameworks, and Ethical Safeguards

JMIR AI 2026;5:e80348

DOI: 10.2196/80348

PMID: 41592221

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.