JMIR Preprints #99185: Fine-Grained Topic Modeling of Patient Speech With Large Language Models: A Multilingual Four-Cohort Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Fine-Grained Topic Modeling of Patient Speech With Large Language Models: A Multilingual Four-Cohort Study

Gustave Cortal;
Sélim Guessoum;
Xuan-Nga Cao;
Santiago de Leon-Martinez;
Enrique Baca-Garcia

ABSTRACT

Background:

Depression is underdiagnosed worldwide, and clinicians rely on interpreting patients' subjective speech. Qualitative analysis of patient language does not scale to routine care, and existing computational approaches describe topics with keyword lists that miss clinical nuance.

Objective:

We evaluated whether large language models (LLMs) can identify clinically relevant topics across multilingual cohorts by clustering spontaneous speech transcripts with LLM embeddings and generating fine-grained natural-language cluster descriptions for interpretability. We further examined which interview questions best elicit clinically discriminant content and how sociodemographic factors modulate cluster membership.

Methods:

We analyzed spontaneous speech transcripts from four independent cohorts totaling 2,067 participants: a French general population sample (n=1,809) and three clinical samples in Italian (n=116), Chinese (n=52), and Spanish (n=90). Responses to open-ended questions were transcribed, embedded with a multilingual language model, dimensionally reduced, grouped by density-based clustering, and each cluster was summarized by a language model into a natural-language description. Cluster membership was tested for association with validated clinical scales (PHQ-9, GAD-7, AIS, MFI, MADRS, C-SSRS) and sociodemographic factors (age, education, sex).

Results:

Unsupervised clustering yielded semantically coherent clusters significantly associated with clinical scores across all four cohorts. In the French general population, clusters discriminated depression (PHQ-9 η²=0.17, P<.001), anxiety (GAD-7), insomnia (AIS), and fatigue (MFI) scores (η² 0.13–0.14, all P<.01). In the clinical cohorts, clusters discriminated depression status in the Italian sample (MADRS; Cramér's V=0.73, P<.01), MDD diagnosis in the Chinese sample (V=0.40–0.55, P<.05), and suicide risk in the Spanish sample (C-SSRS; V=0.50, P<.01). The question "Describe how you're feeling and how your nights have been" was most effective at eliciting clinically discriminant content, with clusters distinguishing qualitatively different experiences (e.g., anxiety-driven insomnia vs. age-related nocturia). In contrast, questions about past or future events yielded small effect sizes (η²≤0.03). Age and sex were independently associated with cluster membership (η²=0.25 and 0.21, respectively, for "Describe your last 24 hours"; both P<.001).

Conclusions:

Fine-grained LLM-based topic modeling can scale qualitative analysis of patient speech across languages while preserving clinical interpretability. Certain interview questions elicit clinically discriminant content far more than others, and sociodemographic factors shape topic content independently of clinical status, an often-overlooked confound in computational psychiatry. This approach may support screening, personalized treatment planning, and culturally sensitive assessment.

Citation

Please cite as:

Cortal G, Guessoum S, Cao XN, de Leon-Martinez S, Baca-Garcia E

Fine-Grained Topic Modeling of Patient Speech With Large Language Models: A Multilingual Four-Cohort Study

JMIR Preprints. 23/04/2026:99185

DOI: 10.2196/preprints.99185

URL: https://preprints.jmir.org/preprint/99185

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Currently submitted to: JMIR Mental Health

Date Submitted: Apr 23, 2026

Open Peer Review Period: Apr 29, 2026 - Jun 24, 2026

(currently open for review)

Fine-Grained Topic Modeling of Patient Speech With Large Language Models: A Multilingual Four-Cohort Study

ABSTRACT

Citation

Copyright