Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Jun 15, 2025
Open Peer Review Period: Jun 16, 2025 - Aug 11, 2025
Date Accepted: Nov 3, 2025
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Automated Speech Analysis for Screening and Monitoring Bipolar Depression: Machine Learning Model Development and Interpretation Study

Min S, Yeum TS, Shin D, Rhee SJ, Lee H, Lee HS, Park S, Lee J, Ahn YM

Automated Speech Analysis for Screening and Monitoring Bipolar Depression: Machine Learning Model Development and Interpretation Study

JMIR Med Inform 2025;13:e79093

DOI: 10.2196/79093

PMID: 41343793

PMCID: 12715464

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Automated Speech Analysis for Screening and Monitoring Depression Using Acoustic and Linguistic Characteristics of Patients with Mood Disorders

  • Sooyeon Min; 
  • Tae-Sung Yeum; 
  • Daun Shin; 
  • Sang Jin Rhee; 
  • Hyunju Lee; 
  • Han-Sung Lee; 
  • Seongmin Park; 
  • Jihwa Lee; 
  • Yong Min Ahn

ABSTRACT

Background:

Depression diagnosis relies on symptomatology and challenges health professionals who depend on subjective evaluations of patients’ reported experiences and observable behavior. Although novel machine learning approaches can objectively quantify speech changes, few studies have systematically compared the longitudinal performance of acoustic and linguistic speech markers in a clinically diagnosed sample.

Objective:

This study aimed to develop between- and within-person classifiers to assess the severity of depression and monitor changes to detect treatment response or recurrence. A secondary objective was to compare different speech modalities in predicting changes in depressive symptoms.

Methods:

We collected 348 voice audio recordings from 104 patients diagnosed with mood disorder over a one-year period. Depression severity was assessed using the Hamilton Depression Rating Scale (HAMD). Acoustic and linguistic features were extracted using the OpenSMILE toolkit and Linguistic Inquiry and Word Count (LIWC) frameworks, following automatic speech recognition and machine translation. Mixed-effect multivariate linear regression was used to evaluate the associations between speech markers and HAMD scores, adjusting for covariates, including age, sex, body mass index, diagnosis, and antipsychotic dosage for acoustic features and age, sex, diagnosis, and years of education for linguistic features. Light Gradient Boosting and eXtreme Gradient Boosting were used as base learner algorithms. We developed between-person classifiers to detect moderate-to-severe depression and within-person classifiers to detect treatment response or relapse and compared their performance across different speech modalities. Hyperparameter tuning and 95% confidence interval estimation were performed using a bootstrap bias-corrected cross-validation approach combined with a grid search. The models were validated using a separate held-out set and fivefold cross-validation.

Results:

We identified significant differences in the acoustic and linguistic speech patterns of patients with depression. Patients with depression showed greater temporal instability in key spectral properties of their speech and diminished pitch variation and used words related to death and negative emotions more frequently than their counterparts. The between-person classifier combining acoustic and linguistic features detected moderate-to-severe depression with an area under the receiver operating characteristic curve (AUC) of 0.78 in a held-out set, compared with the model using demographic features (AUC 0.64). Within-person classifiers detected treatment response above AUC 0.80 and disease relapse above 0.90 across all modalities, surpassing the demographic model (AUC 0.58 and 0.64, respectively).

Conclusions:

Between- and within-person comparisons of speech markers can be leveraged in detecting and monitoring depression. We demonstrate the feasibility of applying LIWC-based psycholinguistic analysis to machine-transcribed and translated speech, supporting the replicability of this approach across languages. Automated multimodal voice analysis can be integrated into digital health platforms, such as telemedicine and smartphone-based monitoring applications, providing a scalable and effective approach for accessing mental health care.


 Citation

Please cite as:

Min S, Yeum TS, Shin D, Rhee SJ, Lee H, Lee HS, Park S, Lee J, Ahn YM

Automated Speech Analysis for Screening and Monitoring Bipolar Depression: Machine Learning Model Development and Interpretation Study

JMIR Med Inform 2025;13:e79093

DOI: 10.2196/79093

PMID: 41343793

PMCID: 12715464

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.