Accepted for/Published in: JMIR Biomedical Engineering
Date Submitted: Jan 16, 2024
Date Accepted: Feb 17, 2024
An Investigation of Deepfake Voice Detection using Speech Pause Patterns: Pilot Study
ABSTRACT
Background:
The digital era has seen an increase in reliance on digital platforms for news and information, along with the emergence of "deepfake" technology, a tool for generating synthetic media content that mimics the physical and vocal attributes of specific individuals. Deepfakes, produced through training deep learning models on extensive data sets of voice recordings, images, and video segments, pose significant threats to media authenticity. These deepfakes are employed across many fields, including entertainment, voice assistants, and voice overs, promising increased efficiency but also carrying the risk of unethical misuse, such as impersonation and the spread of false information.
Objective:
To counteract this challenge, we introduce the concept of innate biological processes to discern between authentic human voices and cloned voices. We propose that the presence or absence of certain perceptual features, such as pauses in speech, can effectively distinguish between cloned and authentic audio.
Methods:
49 adult participants representing diverse ethnic backgrounds and accents were recruited. Each participant contributed voice samples for the training of three distinct voice cloning text-to-speech models and three control paragraphs. Subsequently, the cloning models generated synthetic versions of the control paragraphs, resulting in a dataset consisting of up to nine cloned audio samples and three control samples per participant. We analyzed the speech pauses caused by biological actions such as respiration, swallowing, and cognitive processes in these samples, and evaluated five machine learning models for detection of deepfakes. The generalization capability of the model was evaluated through testing on unseen data, incorporating a model-naive generator, a model-naive paragraph, and model-naive participants.
Results:
Our findings reveal that there are significant (P<0.05) differences in the speech and pause patterns of authentic audio compared to cloned audio. Furthermore, we show that these features associated with the speech pause pattern can reliably differentiate between cloned and authentic audio. Among the machine learning models tested, an AdaBoost model demonstrated the highest performance, achieving a 5-fold cross validation balanced accuracy of 0.81 +/- 0.05 and an overall test accuracy of 0.79.
Conclusions:
The incorporation of perceptual, biological features into machine learning models provides a robust and effective means to distinguish between authentic voice recordings and cloned samples. Given the growing prevalence of unethical deepfake applications, the need for a reliable method to verify audio source authenticity is imperative.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.