Accepted for/Published in: JMIR Mental Health
Date Submitted: Apr 26, 2023
Date Accepted: Sep 12, 2023
A Comparison of HIPAA-Compliant Transcription Services for Virtual Psychiatric Interviews
ABSTRACT
Background:
Automatic speech recognition (ASR) technology is increasingly being used for transcription in clinical contexts. Although there are numerous HIPAA-compliant transcription services using ASR, few studies have compared the word error rate (WER) between different transcription services among different diagnostic groups in a mental health setting. There has also been little research into the types of words ASR transcriptions mistakenly generate or omit.
Objective:
This study compared the WER of three ASR transcription services (Amazon Transcribe, Zoom/Otter AI, and Whisper/Open AI) in interviews across three different clinical categories (controls, participants experiencing depression, and participants experiencing a variety of other mental health conditions). These ASR transcription services were also compared to a commercial human transcription service, REV. Words that were either included or excluded by the error in the transcripts were systematically analyzed by their Linguistic Inquiry and Word Count (LIWC) categories.
Methods:
Participants completed a one-time research psychiatric interview, which was recorded on a secure server. Transcriptions created by the research team were used as the gold standard from which WER was calculated. The interviewees were categorized into either the control group (N = 19), the major depressive disorder (MDD) group (N = 22), or the other group (N = 24) using the Mini-International Neuropsychiatric Interview. The total sample included 65 participants. Brunner-Munzel tests were used for comparing independent sets such as the diagnostic groupings, and Wilcoxon signed-rank tests were used for correlated samples when comparing the total sample between different transcriptions services.
Results:
There were significant differences between each ASR transcription service WER (P < .001). Amazon Transcribe’s output exhibited significantly lower WERs compared to the Zoom/Otter AI and Whisper/Open AI ASR. ASR performances did not significantly differ across the three different clinical categories within each service (P > 0.05). A comparison between the human transcription service output from REV and the best-performing ASR (Amazon Transcribe) demonstrated a significant difference, with REV having a slightly lower median WER (7.6% versus 8.9%). Heatmaps and spider plots were used to visualize the most common errors in LIWC categories, which were found to be within three overarching categories: Conversation, Cognition, and Function.
Conclusions:
Overall, these results indicate that the WER between manual and automated transcription services is narrowing as ASR services advance. These advances, coupled with decreased cost and time in receiving transcriptions, may make ASR transcriptions a more viable option within healthcare settings. However, more research is required to determine if errors in specific types of words impact the analysis and utility of these transcriptions, particularly for specific applications and in a variety of populations in terms of clinical diagnosis, literacy level, accent, and cultural origin.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.