Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Aug 20, 2025
Date Accepted: Jan 10, 2026
Text-Based Depression Estimation Using Machine Learning with Standard Labels: A Systematic Review and Meta-Analysis
ABSTRACT
Background:
Depression affected people daily lives and even leads to suicidal behaviour. Text-based depression estimation using natural language processing (NLP) has emerged as a feasible approach for early mental health screening. However, most existing reviews often included studies with weak depression labels, which affected the reliability of the results and further limited the practical application of the automatic depression estimation (ADE) models.
Objective:
This review aimed to evaluate the predictive performance of text-based depression models which used standard labels, and to identify text resource, text representation, model architecture, annotation source and reporting quality contributing to performance heterogeneity.
Methods:
Following PRISMA guidelines, we systematically searched four main databases (PubMed, Scopus, IEEE Xplore and Web of Science) for studies published between 2014 and 2025. The eligible studies were included: Machine learning models were developed based on the text generated by the participants and used validated scales or clinical diagnoses as depression labels. Pooled effect sizes (r) were calculated using random-effects meta-analysis by Comprehensive Meta-Analysis software version 4.0. Subgroup and meta-regression analyses explored potential moderators.
Results:
We scanned 2,047 articles and finally filtered 14 models from 10 studies for the meta analysis. The overall pooled effect size was r = 0.582 (95% CI 0.487–0.663), indicating a large strength of association. Models using embedding-based features and deep model architectures showed higher predictive performance than those using traditional features and shallow models (r = 0.715 and 0.710; P <.001). Models using clinical diagnoses performed slightly better than those using self-report scales (r = 0.660 vs 0.500; P = .062). Reporting quality, assessed by TRIPOD, was positively associated with model performance (β = 0.077; P <.001), while sample size and positive rate were not significant.
Conclusions:
The text-based depression estimation models trained with standard labels perform well. Embedding features and deep model architecture yield better results. Using clinical diagnoses labels and transcribed speech tend to yield higher performance, though the influence is not statistically significant. Transparent reporting is essential for model reproducibility and comparison. Clinical Trial: PROSPERO (CRD20251056902)
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.