Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Formative Research

Date Submitted: Jan 13, 2025
Open Peer Review Period: Jan 13, 2025 - Jan 22, 2025
Date Accepted: Sep 19, 2025
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Clinical Summaries of Social Media Timelines for Mental Health Monitoring: Human Versus Large Language Model Comparative Evaluation Study

Klein A, Song J, Chim J, Keren L, Triantafyllopoulos A, Schuller B, Liakata M, Atzil-Slonim D

Clinical Summaries of Social Media Timelines for Mental Health Monitoring: Human Versus Large Language Model Comparative Evaluation Study

JMIR Form Res 2026;10:e71230

DOI: 10.2196/71230

PMID: 41894679

Clinical Summaries of Social Media Timelines for Mental Health Monitoring: Human vs. Large Language Model Comparative Evaluation

  • Ayal Klein; 
  • Jiayu Song; 
  • Jenny Chim; 
  • Liran Keren; 
  • Andreas Triantafyllopoulos; 
  • Björn Schuller; 
  • Maria Liakata; 
  • Dana Atzil-Slonim

ABSTRACT

Background:

Social media timelines contain rich signals of users’ mental states but are too voluminous for direct clinical review. Although large language models (LLMs) demonstrate robust linguistic and summarization capabilities in general‑purpose tasks, distilling clinically relevant insights demands deeper psychological analysis and sensitivity to each individual’s unique personality and context. Accurately capturing subtle, personalized affective and behavioral patterns remains a significant challenge for current models. A thorough, systematic evaluation of LLM‑generated clinical summaries is therefore essential to understand their readiness for real‑world mental health monitoring.

Objective:

This study evaluates the ability of an LLM-based pipeline to generate clinically meaningful summaries of social media timelines, compared to summaries written by human clinicians. The summaries are structured along three key clinical aspects: an overall mental health assessment, intrapersonal and interpersonal patterns, and mental state changes over time.

Methods:

We utilize a recent state-of-the-art approach that combines a hierarchical variational autoencoder (VAE) with an LLM (LLaMA2 13B). This method first summarizes the patient's history using the VAE and then transforms this summary into a clinical narrative using the LLM. We also test both single-step and multi-step LLM-prompting techniques and devise comprehensive clinical prompts. For 30 social media timelines, model outputs were evaluated against human-written summaries through human ratings and expert qualitative analysis. Linguistic diversity was automatically measured as a proxy for personalization.

Results:

Human summaries scored highest for factual consistency (3.75) and general usefulness (3.63). The TH-VAE model outperformed LLaMA for factual consistency (3.35 vs. 3.08) and general usefulness (3.28 vs. 3.38). Both two-step models were comparable to humans in describing interpersonal and intrapersonal patterns (3.45–3.48 vs. 3.33) and changes over time (3.42 vs. 3.35–3.30). The naive LLaMA baseline scored lower on all criteria except factual consistency. Furthermore, a qualitative analysis observed that human summaries provided more accurate, deep and personalized insights, while LLMs offered more exhaustive but generic descriptions. Quantitatively, linguistic diversity was higher in human summaries both at the semantic level (mean Cohen's d = 1.19) and at the surface level (mean Cohen's d = 1.31).

Conclusions:

Current medium-size LLMs can generate largely accurate and informative clinical summaries of social media timelines, and advanced prompting boosts performance modestly. However, they currently underperform human clinicians in capturing subtle psychological nuances and individual idiosyncrasies. Future work should integrate domain‑specific fine‑tuning and enhanced context modeling to improve LLM clinical fidelity.


 Citation

Please cite as:

Klein A, Song J, Chim J, Keren L, Triantafyllopoulos A, Schuller B, Liakata M, Atzil-Slonim D

Clinical Summaries of Social Media Timelines for Mental Health Monitoring: Human Versus Large Language Model Comparative Evaluation Study

JMIR Form Res 2026;10:e71230

DOI: 10.2196/71230

PMID: 41894679

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.