Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR AI

Date Submitted: Nov 25, 2024
Open Peer Review Period: Dec 23, 2024 - Feb 17, 2025
Date Accepted: Mar 31, 2025
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Digital Phenotyping for Detecting Depression Severity in a Large Payor-Provider System: Retrospective Study of Speech and Language Model Performance

Karlin B, Henry D, Anderson R, Cieri S, Aratow M, Shriberg E, Hoy M

Digital Phenotyping for Detecting Depression Severity in a Large Payor-Provider System: Retrospective Study of Speech and Language Model Performance

JMIR AI 2025;4:e69149

DOI: 10.2196/69149

PMID: 40605836

PMCID: 12223686

Digital Phenotyping for Detecting Depression Severity in a Large Payor-Provider System: Speech and Language Model Performance

  • Bradley Karlin; 
  • Doug Henry; 
  • Ryan Anderson; 
  • Salvatore Cieri; 
  • Mchael Aratow; 
  • Elizabeth Shriberg; 
  • Michelle Hoy

ABSTRACT

Background:

There is considerable need to improve and increase the detection and measurement of depression. The use of voice as a digital biomarker of depression represents a considerable opportunity for transforming and accelerating depression identification and treatment; however, research to date has primarily consisted of small-sample feasibility or pilot studies incorporating highly controlled applications and settings. There has been limited examination of the technology in real-world use contexts.

Objective:

There is considerable need to improve and increase the detection and measurement of depression. The use of voice as a digital biomarker of depression represents a considerable opportunity for transforming and accelerating depression identification and treatment; however, research to date has primarily consisted of small-sample feasibility or pilot studies incorporating highly controlled applications and settings. There has been limited examination of the technology in real-world use contexts.

Methods:

2086 recordings of case management calls with verbally administered PHQ-9 surveys were analyzed using the ML model after the portions of the recordings with the PHQ-9 survey were manually redacted. The recordings were divided into a Development set (n=1336) and Blind set (n=671) and PHQ-8 scores were provided for the Development set for ML model refinement while PHQ-8 scores from the Blind set were withheld until after ML model depression severity output was reported.

Results:

The Development set and Blind set were well matched for age, gender and depression severity, with mean and standard deviation of age of the Development set 53.7+/- 16.3 years and the Blind set 51.7 +/- 16.9 years, biological sex of the Development set 68.1% female and the Blind set 68.8% female and mean and standard deviation of the PHQ-8 scores of the Development set 10.5 +/- 6.1 and the Blind set 10.9 +/- 6.0 respectively. The Concordance Correlation Coefficient (CCC) for the test of the ML model on the Development set was pc=0.57 and for the Blind set pc=0.54, while the MAE for the Development set was 3.91 and for the Blind set was 4.06, demonstrating strong model performance. This performance was maintained when dividing each set into subgroups of age brackets (<=39, 40-64 and >=65), biological sex, and the four categories of Social Vulnerability Index (SVI, an index based on 16 social factors) with CCCs ranging from pc=0.44-0.61. Performance at PHQ-8 threshold score cutoffs of 5, 10, 15 and 20 representing the depression severity categories of none, mild, moderate, moderately severe and severe (>=20) respectively, expressed as Receiver Operating Characteristic Curve – Area Under the Curve (ROC-AUC) values, varied between 0.79 and 0.83 in both the Development and Blind sets.

Conclusions:

Overall, the findings suggest that voice may have significant potential for detection and measurement of depression severity over a variety of ages, gender and socioeconomic categories that may enhance treatment, improve clinical decision-making, and enable truly personalized treatment recommendations.


 Citation

Please cite as:

Karlin B, Henry D, Anderson R, Cieri S, Aratow M, Shriberg E, Hoy M

Digital Phenotyping for Detecting Depression Severity in a Large Payor-Provider System: Retrospective Study of Speech and Language Model Performance

JMIR AI 2025;4:e69149

DOI: 10.2196/69149

PMID: 40605836

PMCID: 12223686

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.