Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Formative Research

Date Submitted: May 27, 2025
Open Peer Review Period: May 27, 2025 - Jul 22, 2025
Date Accepted: Sep 9, 2025
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Preprocessing Large-Scale Conversational Datasets: A Framework and Its Application to Behavioral Health Transcripts

Naim PM, Sadeh-Sharvit S, Jefroykin S, Silber E, Morisson DP, Goldstein A

Preprocessing Large-Scale Conversational Datasets: A Framework and Its Application to Behavioral Health Transcripts

JMIR Form Res 2025;9:e78082

DOI: 10.2196/78082

PMID: 41135026

PMCID: 12551936

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

From Data to Dataset: A Framework for Working with Behavioral Treatment Transcripts

  • Paz Mor Naim; 
  • Shiri Sadeh-Sharvit; 
  • Samuel Jefroykin; 
  • Eddie Silber; 
  • Dennis P. Morisson; 
  • Ariel Goldstein

ABSTRACT

Background:

The rise of AI and accessible audio equipment has led to a proliferation of recorded conversation transcripts datasets across various fields. However, automatic mass recording and transcription often produce noisy, unstructured data. First, these datasets naturally include unintended recordings, such as hallway conversations, background noise and media (e.g., TV programs, radio, phone calls). Second, automatic speech recognition (ASR) and speaker diarization errors can result in misidentified words, speaker misattributions, and other transcription inaccuracies. As a result, large conversational transcript datasets require careful preprocessing and filtering to ensure their research utility. This challenge is particularly relevant in behavioral health contexts (e.g., therapy, treatment, counselling): while these transcripts offer valuable insights into patient-provider interactions, therapeutic techniques, and client progress, they must accurately represent the conversations to support meaningful research.

Objective:

We present a framework for preprocessing and filtering large datasets of conversational transcripts and apply it to a dataset of behavioral health transcripts from community mental health clinics across the United States. Within this framework we explore tools to efficiently filter non-sessions – transcripts of recordings in these clinics that do not reflect a behavioral treatment session but instead capture unrelated conversations or background noise.

Methods:

Our framework integrates basic feature extraction, human annotation, and advanced applications of large language models (LLMs). We begin by mapping transcription errors and assessing the distribution of sessions and non-sessions. Next, we identify key features to analyze how outliers help in characterizing the type of transcript. Notably, we use LLM perplexity as a measure of comprehensibility to assess transcript noise levels. Finally, we use zero-shot LLM prompting to classify transcripts as sessions or non-sessions, validating LLM decisions against expert annotations. Throughout, we prioritize data security by selecting tools that preserve anonymity and minimize the risk of data breaches.

Results:

Our findings demonstrated that basic statistical outliers, such as speaking rate, are associated with transcription errors and are observed more frequently in non-sessions versus sessions. Specifically, LLM perplexity can flag fragmented and non-verbal segments and is generally lower in sessions (permutation test mean difference = -258, p<0.05), thus can serve as a filtering aiding tool. Additionally, LLM algorithms have shown an ability to distinguish between sessions and non-sessions with high validity (κ=0.71), while also capturing the nature of the meeting.

Conclusions:

This study’s hybrid approach effectively characterizes errors, evaluates content, and distinguishes different text types within unstructured conversational datasets. It provides a foundation for research on conversational data, providing key methods and practical guidelines that serve as crucial first steps in ensuring data quality and usability, particularly in the context of mental health sessions. We highlight the importance of integrating clinical experts with AI tools while prioritizing data security throughout the process.


 Citation

Please cite as:

Naim PM, Sadeh-Sharvit S, Jefroykin S, Silber E, Morisson DP, Goldstein A

Preprocessing Large-Scale Conversational Datasets: A Framework and Its Application to Behavioral Health Transcripts

JMIR Form Res 2025;9:e78082

DOI: 10.2196/78082

PMID: 41135026

PMCID: 12551936

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.