JMIR Preprints #83352: Prediction of 12-Week Remission in Depressive Disorder Using Reasoning-Based Large Language Models: Clinical Evaluation of Accuracy and Interpretability

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Prediction of 12-Week Remission in Depressive Disorder Using Reasoning-Based Large Language Models: Clinical Evaluation of Accuracy and Interpretability

Jin-Hyun Park;
Hee-Ju Kang;
Ji Hyeon Jeon;
Sung-Gil Kang;
Ju-Wan Kim;
Jae-Min Kim;
Hwamin Lee

ABSTRACT

Background:

Depressive disorder affects over 300 million people globally, with only 30-40% of patients achieving remission with initial antidepressant monotherapy. This low response rate highlights the critical need for digital mental health tools that can identify treatment response early in the clinical pathway.

Objective:

This study aimed to evaluate whether reasoning-based large language models (LLMs) could accurately predict 12-week remission in patients with depressive disorder undergoing antidepressant monotherapy and to assess the clinical validity and interpretability of model-generated rationales for integration into digital mental health workflows.

Methods:

We analyzed data from 390 patients in the MAKE Biomarker Discovery study who were undergoing first-step antidepressant monotherapy with 12 different medications including escitalopram, paroxetine, sertraline, duloxetine, venlafaxine, desvenlafaxine, milnacipran, mirtazapine, bupropion, vortioxetine, tianeptine, and trazodone after excluding those with uncommon medications (n=9) or missing biomarker data (n=32). Three LLMs (ChatGPT o1, o3-mini, Claude 3.7 Sonnet) were tested using advanced prompting strategies including zero-shot chain-of-thought, atom-of-thoughts, and our novel referencing of deep research (RoD) prompt. Model performance was evaluated using balanced accuracy, sensitivity, specificity, positive predictive value, and negative predictive value. Three psychiatrists independently assessed model outputs for clinical validity using 5-point Likert scales across multiple dimensions.

Results:

Claude 3.7 Sonnet with 32,000 reasoning tokens using the RoD prompt achieved the highest performance (balanced accuracy=0.6697, sensitivity=0.7183, specificity=0.6210). Medication-specific analysis revealed negative predictive values exceeding 0.75 across major antidepressants, indicating particular utility in identifying likely non-responders. Clinical evaluation by psychiatrists showed favorable ratings for correctness (mean, standard deviation [SD]; 4.3, [0.7]), consistency (4.2, [0.8]), specificity (4.2, [0.7]), helpfulness (4.2, [1.0]), and human-likeness (3.6, [1.7]) on 5-point scales.

Conclusions:

These findings demonstrate that reasoning-based LLMs, particularly when enhanced with research-informed prompting, show promise for predicting antidepressant response and could serve as interpretable adjunctive tools in depressive disorder treatment planning, though prospective validation in real-world clinical settings remains essential.

Citation

Please cite as:

Park JH, Kang HJ, Jeon JH, Kang SG, Kim JW, Kim JM, Lee H

Prediction of 12-Week Remission in Patients With Depressive Disorder Using Reasoning-Based Large Language Models: Model Development and Validation Study

JMIR Ment Health 2026;13:e83352

DOI: 10.2196/83352

PMID: 41576265

PMCID: 12829737

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Mental Health

Date Submitted: Sep 1, 2025

Date Accepted: Dec 11, 2025

Prediction of 12-Week Remission in Depressive Disorder Using Reasoning-Based Large Language Models: Clinical Evaluation of Accuracy and Interpretability

ABSTRACT

Citation

Copyright