Accepted for/Published in: JMIR Mental Health
Date Submitted: Feb 12, 2024
Date Accepted: May 23, 2024
Exploring the Efficacy of Large Language Models in Summarizing Mental Health Counseling Sessions: A Benchmark Study
ABSTRACT
Background:
Comprehensive summaries of sessions enables an effective continuity in mental health counseling, facilitating informed therapy planning. Yet, manual summarization presents a significant challenge, diverting experts' attention from the core counseling process. Leveraging advancements in automatic summarization addresses this issue, offering mental health professionals accessibility and efficiency by streamlining the summarization of lengthy therapy sessions. However, existing approaches often overlook the nuanced intricacies inherent in counseling interactions.
Objective:
This study evaluates the effectiveness of state-of-the-art Large Language Models (LLMs) in selectively summarizing various components of therapy sessions through aspect-based summarization, aiming to benchmark their performance.
Methods:
We introduce MentalCLOUDS, a counseling-component guided summarization dataset. This benchmarking dataset consists of 191 counseling sessions with summaries focused on three distinct counseling components (aka counseling aspects). Additionally, we assess the capabilities of 11 state-of-the-art LLMs in addressing the task of component-guided summarization in counseling. The generated summaries are evaluated quantitatively using standard summarization metrics and verified qualitatively by mental health professionals.
Results:
Our findings demonstrate the superior performance of task-specific LLMs such as MentalLlama, Mistral, and MentalBART in terms of standard quantitative metrics such as Rouge-1, Rouge-2, Rouge-L, and BERTScore across all aspects of counseling components. Further, expert evaluation reveals that Mistral supersedes both MentalLlama and MentalBART based on six parameters — affective attitude, burden, ethicality, coherence, opportunity costs, and perceived effectiveness. However, these models share the same weakness by demonstrating a potential for improvement in the opportunity costs, and perceived effectiveness metrics.
Conclusions:
While LLMs fine-tuned specifically in the mental health domain exhibit better performance based on automatic evaluation scores, expert assessments indicate that these models are not yet reliable for clinical applications. Further refinement and validation are necessary before their implementation in practice.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.