Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Mental Health

Date Submitted: Jun 29, 2025
Open Peer Review Period: Jun 29, 2025 - Aug 24, 2025
Date Accepted: Sep 29, 2025
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Evaluating Generative AI Psychotherapy Chatbots Used by Youth: Cross-Sectional Study

Sobowale K, Humphrey DK, Zhao SY

Evaluating Generative AI Psychotherapy Chatbots Used by Youth: Cross-Sectional Study

JMIR Ment Health 2025;12:e79838

DOI: 10.2196/79838

PMID: 41370787

PMCID: 12694945

Evaluating Generative Artificial Intelligence Psychotherapy Chatbots Used by Youth: A Cross-Sectional Study

  • Kunmi Sobowale; 
  • Daniel Kevin Humphrey; 
  • Sophia Yingruo Zhao

ABSTRACT

Background:

Many youth rely on direct-to-consumer generative artificial intelligence (GenAI) chatbots for mental health support, yet the quality of the psychotherapeutic capabilities of these chatbots is understudied.

Objective:

We sought to comprehensively evaluate and compare the quality of widely used GenAI chatbots with psychotherapeutic capabilities.

Methods:

In this cross-sectional study, trained raters used an evaluation framework to rate the quality of five chatbots from GenAI platforms widely used by youth. Trained raters roleplayed as youth using personas of youth with mental health challenges to prompt chatbots, facilitating conversations. Chatbot responses were generated from August to October 2024. The primary outcomes were rated scores in nine sections. The proportion of high-quality ratings (binary rating of 1) across each section was compared between chatbots using Bonferroni-corrected χ2 tests.

Results:

While GenAI chatbots were found to be accessible (104 high-quality ratings [87%]) and avoid harmful statements and misinformation (71 of 80 [89%]), they performed poorly in their therapeutic approach (14 of 45 [35%]) and their ability to monitor and assess risk (31 of 80 [39%]). Information on chatbot model training and knowledge was unavailable, resulting in low scores. Bonferroni-corrected χ2 tests showed statistically significant differences in chatbot quality in the background, therapeutic approach, and monitoring and risk evaluation sections. Qualitatively, raters perceived most chatbots as having strong conversational abilities but found them plagued by various issues, including fabricated content and poor handling of crisis situations.

Conclusions:

Overall, direct-to-consumer GenAI chatbots showed mixed results in terms of quality, suggesting potential for harm and demonstrating a greater need for transparency and oversight. These findings may enable youth and other stakeholders to make informed decisions about using chatbots for mental health support.


 Citation

Please cite as:

Sobowale K, Humphrey DK, Zhao SY

Evaluating Generative AI Psychotherapy Chatbots Used by Youth: Cross-Sectional Study

JMIR Ment Health 2025;12:e79838

DOI: 10.2196/79838

PMID: 41370787

PMCID: 12694945

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.