Currently accepted at: Journal of Medical Internet Research
Date Submitted: Sep 9, 2025
Date Accepted: Apr 20, 2026
Date Submitted to PubMed: May 5, 2026
This paper has been accepted and is currently in production.
It will appear shortly on 10.2196/83790
The final accepted version (not copyedited yet) is in this tab.
An "ahead-of-print" version has been submitted to Pubmed, see PMID: 42084850
Explainable and Interpretable AI for Voice and Speech Analysis in Clinical Care: A Systematic Review
ABSTRACT
Background:
Driven by recent advances in artificial intelligence, particularly in medicine, audio-based voice and speech biomarkers are increasingly investigated for various medical applications as a complementary or even alternative modality to traditional medical devices. The adoption of deep learning techniques in recent literature is motivated by their superior performance compared to classical machine learning (ML) methods. However, ethical and regulatory concerns regarding the black-box nature of these models have limited their integration into clinical workflows. Consequently, Explainable AI (XAI) has recently been employed to address this issue by generating explanations for opaque model output. Ideally, medical XAI systems aim to provide human-understandable, clinically grounded explanations essential for enhanced AI trustworthiness and, thereby, facilitated adoption into real-world clinical settings.
Objective:
We conduct a systematic literature review of XAI methods applied for explaining deep learning techniques in audio-based voice and speech clinical applications. We present a taxonomy of XAI methods in the literature and discuss the limitations of these methods, particularly for their application to clinical audio, evaluation of XAI outputs, and stakeholder relevance of generated explanation. Then, we identify opportunities and recommendations for future clinical audio XAI design.
Methods:
This review follows the Systematic Reviews and Meta-Analyses (PRISMA) guidelines. Six databases (IEEEXplore, ACM, Scopus, PubMed, Web of Science, and Nature) were searched for articles between January 2015 and February 2025. Included studies applied explainability and/or interpretability methods to deep learning techniques for clinical voice and speech audio.
Results:
A taxonomy of XAI methods is presented for 30 eligible studies. These methods are grouped into four categories: visualization-based techniques, feature-importance and attribution methods, attention-based explanations, and concept detectors and model intrinsic approaches. We find that current XAI methods and implementations lack rigorous evaluation and validation, are not suitable for the unique nature of clinical audio, and do not align with stakeholder expectations and needs.
Conclusions:
This survey presents a categorization of XAI techniques employed for voice and speech AI. We discuss several gaps and considerations and identify several opportunities for future clinical audio XAI design.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.