Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Formative Research

Date Submitted: Sep 15, 2025
Date Accepted: Feb 27, 2026

The final, peer-reviewed published version of this preprint can be found here:

Multimodal Depression Detection Through Conversational Interactions with an Emotion-Aware Social Robot: Pilot Study

Liao PY, Su YQ, Chang YL, Lee YH, Fu LC, Qian Xb

Multimodal Depression Detection Through Conversational Interactions with an Emotion-Aware Social Robot: Pilot Study

JMIR Form Res 2026;10:e84110

DOI: 10.2196/84110

PMID: 42044486

Multimodal Depression Detection through Conversational Interactions with an Emotion-Aware Social Robot: Pilot Study

  • Pu-Yu Liao; 
  • Yu-Quan Su; 
  • Yu-Ling Chang; 
  • Yun-Hsiang Lee; 
  • Li-Chen Fu; 
  • Xiao-bei Qian

ABSTRACT

Background:

Depression affects more than 300 million people worldwide and is a leading contributor to the global disease burden. Traditional diagnostic methods, such as structured clinical interviews, are reliable but impractical for frequent or large-scale screening. Self-report tools like the PHQ-8 require disclosure and clinician oversight, limiting accessibility. Recent AI-based approaches leverage multimodal behavioral cues (linguistic, acoustic, visual) for automated depression detection but remain constrained by limited adaptability, scarce annotated data, weak emotional expression in real-world settings, and the high computational cost of deployment on social assistant robots (SARs).

Objective:

This study introduces DEPRESAR-Fusion, a lightweight multimodal depression detection framework designed for natural interactions with emotion-aware SARs. The objective was to enhance detection accuracy in everyday conversations while addressing the challenges of data scarcity, weak emotional cues, and computational efficiency.

Methods:

DEPRESAR-Fusion integrates acoustic, linguistic, and visual features with an emotion-aware response module powered by large language models (LLMs) to adapt conversational strategies dynamically. To stimulate richer emotional expression, participants were exposed to emotionally evocative videos before SAR interactions. To overcome data scarcity, we augmented training with (1) public depression-related social media corpora and (2) synthetic samples generated via LLMs. The proposed multimodal fusion architecture was evaluated on benchmark clinical datasets for both binary depression classification and PHQ-8 regression tasks. Performance was compared against prior multimodal baselines using root mean square error (RMSE), mean absolute error (MAE), and classification accuracy.

Results:

Participants who viewed emotional stimuli before interacting with SARs exhibited significantly higher emotional expressiveness, leading to improved model performance. Regression tasks showed lower RMSE and MAE, while classification tasks achieved significantly higher accuracy than the non-stimulus condition. DEPRESAR-Fusion outperformed prior multimodal baselines across multiple benchmark datasets, achieving state-of-the-art performance in both binary classification and PHQ-8 regression. The system maintained a lightweight architecture suitable for real-time deployment on SARs.

Conclusions:

DEPRESAR-Fusion demonstrates that integrating emotion induction, data augmentation, and lightweight multimodal fusion can enable accurate and scalable depression detection in naturalistic SAR interactions. By bridging the gap between structured clinical assessments and everyday conversations, this approach highlights the potential of SAR-based systems as non-intrusive, AI-driven tools for proactive mental health support. Clinical Trial: This study involving human participants was reviewed and approved by the NTHU Research Ethics Committee D (National Tsing Hua University, Taiwan) on June 18, 2021 (Protocol No. 202105013RINB). As the primary outcome of this work was system performance evaluation rather than patient health outcomes, this study does not constitute a clinical trial as defined by ICMJE guidelines, and trial registration was not required.


 Citation

Please cite as:

Liao PY, Su YQ, Chang YL, Lee YH, Fu LC, Qian Xb

Multimodal Depression Detection Through Conversational Interactions with an Emotion-Aware Social Robot: Pilot Study

JMIR Form Res 2026;10:e84110

DOI: 10.2196/84110

PMID: 42044486

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.