Accepted for/Published in: JMIR Formative Research
Date Submitted: Jan 13, 2025
Open Peer Review Period: Jan 13, 2025 - Jan 22, 2025
Date Accepted: Sep 19, 2025
(closed for review but you can still tweet)
Clinical Summaries of Social Media Timelines for Mental Health Monitoring: Human vs. Large Language Model Comparative Evaluation
ABSTRACT
Background:
Social media timelines contain rich signals of users’ mental states but are too voluminous for direct clinical review. Although large language models (LLMs) demonstrate robust linguistic and summarization capabilities in general‑purpose tasks, distilling clinically relevant insights demands deeper psychological analysis and sensitivity to each individual’s unique personality and context. Accurately capturing subtle, personalized affective and behavioral patterns remains a significant challenge for current models. A thorough, systematic evaluation of LLM‑generated clinical summaries is therefore essential to understand their readiness for real‑world mental health monitoring.
Objective:
This study evaluates the ability of an LLM-based pipeline to generate clinically meaningful summaries of social media timelines, compared to summaries written by human clinicians. The summaries are structured along three key clinical aspects: an overall mental health assessment, intrapersonal and interpersonal patterns, and mental state changes over time.
Methods:
We utilize a recent state-of-the-art approach that combines a hierarchical variational autoencoder (VAE) with an LLM (LLaMA2 13B). This method first summarizes the patient's history using the VAE and then transforms this summary into a clinical narrative using the LLM. We also test both single-step and multi-step LLM-prompting techniques and devise comprehensive clinical prompts. For 30 social media timelines, model outputs were evaluated against human-written summaries through human ratings and expert qualitative analysis. Linguistic diversity was automatically measured as a proxy for personalization.
Results:
Human summaries scored highest for factual consistency (3.75) and general usefulness (3.63). The TH-VAE model outperformed LLaMA for factual consistency (3.35 vs. 3.08) and general usefulness (3.28 vs. 3.38). Both two-step models were comparable to humans in describing interpersonal and intrapersonal patterns (3.45–3.48 vs. 3.33) and changes over time (3.42 vs. 3.35–3.30). The naive LLaMA baseline scored lower on all criteria except factual consistency. Furthermore, a qualitative analysis observed that human summaries provided more accurate, deep and personalized insights, while LLMs offered more exhaustive but generic descriptions. Quantitatively, linguistic diversity was higher in human summaries both at the semantic level (mean Cohen's d = 1.19) and at the surface level (mean Cohen's d = 1.31).
Conclusions:
Current medium-size LLMs can generate largely accurate and informative clinical summaries of social media timelines, and advanced prompting boosts performance modestly. However, they currently underperform human clinicians in capturing subtle psychological nuances and individual idiosyncrasies. Future work should integrate domain‑specific fine‑tuning and enhanced context modeling to improve LLM clinical fidelity.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.