Currently submitted to: JMIR Mental Health
Date Submitted: Apr 1, 2026
Open Peer Review Period: Apr 1, 2026 - May 27, 2026
(currently open for review)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Large language models as a new tool for iCBT therapists? A blinded clinician rating experiment
ABSTRACT
Background:
Internet-based cognitive behavioral therapy (iCBT) is an effective and scalable alternative to face-to-face psychotherapy, but its reach is constrained by the time therapists spend reviewing patient input and manually drafting written responses. Studies suggest that large language models (LLMs) may be capable of generating high-quality therapeutic text, and may potentially be able to support therapists to deliver high quality care and increasing the number of patients each therapist can serve. The suitability of LLMs in an iCBT setting, however, remains insufficiently studied.
Objective:
This study aims to evaluate the quality of LLM-generated iCBT responses to patient messages by comparing them to the quality of responses produced by humans.
Methods:
In a pre-registered blinded clinician rating experiment, experienced clinicians assessed the quality of human-produced versus LLM-generated therapist responses within a simulated iCBT treatment for functional somatic disorder. Raters were exposed to a stimulus material consisting of five fictitious patient messages, each paired with one human and one LLM-generated response. Raters assessed message/response pairs on five quality dimensions (overall quality, helpfulness, empathy, professionalism, and protocol adherence) and were asked to indicate the source of the response (human/LLM). Analyses were primarily descriptive, supplemented by exploratory statistical tests and descriptive thematic content analysis of open-ended text fields. The full pre-registered study protocol is available at https://osf.io/yxncv/.
Results:
A total of 61 raters provided data while 54 were eligible and included for analysis. Human- and LLM-generated responses were rated similarly across quality dimensions on a 1-5 scale: overall quality (LLM: M = 4.00 vs human: M = 3.96, d = 0.06), helpfulness (LLM: M = 3.85 vs human: M = 3.93, d = 0.13), professionalism (LLM: M = 4.25 vs human: M = 4.11, d = 0.24), and protocol adherence (LLM: M = 4.13 vs human: M = 4.13, d = 0.03). LLM-generated responses, however, received higher scores on empathy (LLM: M = 4.31 vs human: M = 4.08, d=0.42). Raters correctly identified the source of human-generated responses (79%) more accurately than LLM-generated responses (63%). In all, 55% of raters responded to one or more open text fields. Qualitative analysis indicated that LLM-generated responses were perceived as polished but also generic and at times excessively empathetic.
Conclusions:
LLM-generated responses were judged to be of comparable quality to those written by human therapists, though qualitative feedback indicated they were at times generic and insufficiently challenging. These findings provide initial support for the feasibility of using LLMs as therapist-support tools in iCBT, but further research is needed to determine whether their integration yields tangible clinical and organizational benefits.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.