Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: May 15, 2024
Date Accepted: Nov 17, 2024
Date Submitted to PubMed: Nov 22, 2024
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Usefulness of Automatic Speech Recognition Assessment of Children with Speech Sound Disorders: Validation Study
ABSTRACT
Background:
Speech sound disorders (SSDs) are common communication challenges in children, evaluated using standardized tools by speech language pathologists. However, traditional evaluation methods are time-consuming and subject to slight variations in reliability among testers.
Objective:
We developed and assessed the performance of an automatic speech recognition (ASR) model in detecting incorrect pronunciations among children with speech sound disorders (SSDs).
Methods:
This ASR model is an end-to-end model pretrained on a dataset comprising 436,000 hours of adult voice data spanning 128 languages. The model was additionally trained with 137 hours of speech data from typically developing children to adapt it to children’s voices and from children with articulation errors (93.6 minutes) to enhance error detection. Two standardized SSDs tests, Assessment of Phonology and Articulation for Children (APAC) and Urimal Test of Articulation and Phonology (U-TAP), were utilized, and the ASR transcriptions were compared with those by speech-language pathologists (SLPs).
Results:
This study included 30 children, aged 3–7 years, who were suspected to have speech sound disorders (SSDs). The reliability between SLPs and ASR for the percentage of consonants correct (PCC) was excellent, with an interclass correlation coefficient (ICC) of 0.984 for APAC (95% CI: .953–.994) and 0.978 for UTAP (95% CI: .941–.990). The phoneme error rates (PER) for APAC and U-TAP were 11.5% and 12.22%, respectively, reflecting discrepancies at the phoneme level between ASR and SLPs transcriptions. Regarding disagreements between the ASR and SLPs, there were an average of 2.37 and 2.7 occurrences per child for phonemes transcribed as correct pronunciations and 7.8 and 7 occurrences per child for phonemes transcribed as incorrect pronunciations by SLPs in APAC and U-TAP, respectively.
Conclusions:
This study demonstrates the effectiveness of ASR in identifying incorrect pronunciations in children with SSDs.
Citation