JMIR Preprints #78082: From Data to Dataset: A Framework for Working with Behavioral Treatment Transcripts

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

From Data to Dataset: A Framework for Working with Behavioral Treatment Transcripts

Paz Mor Naim;
Shiri Sadeh-Sharvit;
Samuel Jefroykin;
Eddie Silber;
Dennis P. Morisson;
Ariel Goldstein

ABSTRACT

Background:

The rise of AI and accessible audio equipment has led to a proliferation of recorded conversation transcripts datasets across various fields. However, automatic mass recording and transcription often produce noisy, unstructured data. First, these datasets naturally include unintended recordings, such as hallway conversations, background noise and media (e.g., TV programs, radio, phone calls). Second, automatic speech recognition (ASR) and speaker diarization errors can result in misidentified words, speaker misattributions, and other transcription inaccuracies. As a result, large conversational transcript datasets require careful preprocessing and filtering to ensure their research utility. This challenge is particularly relevant in behavioral health contexts (e.g., therapy, treatment, counselling): while these transcripts offer valuable insights into patient-provider interactions, therapeutic techniques, and client progress, they must accurately represent the conversations to support meaningful research.

Objective:

We present a framework for preprocessing and filtering large datasets of conversational transcripts and apply it to a dataset of behavioral health transcripts from community mental health clinics across the United States. Within this framework we explore tools to efficiently filter non-sessions – transcripts of recordings in these clinics that do not reflect a behavioral treatment session but instead capture unrelated conversations or background noise.

Methods:

Our framework integrates basic feature extraction, human annotation, and advanced applications of large language models (LLMs). We begin by mapping transcription errors and assessing the distribution of sessions and non-sessions. Next, we identify key features to analyze how outliers help in characterizing the type of transcript. Notably, we use LLM perplexity as a measure of comprehensibility to assess transcript noise levels. Finally, we use zero-shot LLM prompting to classify transcripts as sessions or non-sessions, validating LLM decisions against expert annotations. Throughout, we prioritize data security by selecting tools that preserve anonymity and minimize the risk of data breaches.

Results:

Our findings demonstrated that basic statistical outliers, such as speaking rate, are associated with transcription errors and are observed more frequently in non-sessions versus sessions. Specifically, LLM perplexity can flag fragmented and non-verbal segments and is generally lower in sessions (permutation test mean difference = -258, p<0.05), thus can serve as a filtering aiding tool. Additionally, LLM algorithms have shown an ability to distinguish between sessions and non-sessions with high validity (κ=0.71), while also capturing the nature of the meeting.

Conclusions:

This study’s hybrid approach effectively characterizes errors, evaluates content, and distinguishes different text types within unstructured conversational datasets. It provides a foundation for research on conversational data, providing key methods and practical guidelines that serve as crucial first steps in ensuring data quality and usability, particularly in the context of mental health sessions. We highlight the importance of integrating clinical experts with AI tools while prioritizing data security throughout the process.

Citation

Please cite as:

Naim PM, Sadeh-Sharvit S, Jefroykin S, Silber E, Morisson DP, Goldstein A

Preprocessing Large-Scale Conversational Datasets: A Framework and Its Application to Behavioral Health Transcripts

JMIR Form Res 2025;9:e78082

DOI: 10.2196/78082

PMID: 41135026

PMCID: 12551936

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Formative Research

Date Submitted: May 27, 2025

Open Peer Review Period: May 27, 2025 - Jul 22, 2025

Date Accepted: Sep 9, 2025

(closed for review but you can still tweet)

From Data to Dataset: A Framework for Working with Behavioral Treatment Transcripts

ABSTRACT

Citation

Copyright