JMIR Preprints #83927: Simulated Reasoning and Self-Verification for Psychiatric Diagnosis in Generalist Large Language Models: Comparative Evaluation

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Simulated Reasoning and Self-Verification for Psychiatric Diagnosis in Generalist Large Language Models: Comparative Evaluation

Karthik V Sarma;
Kaitlin E Hanss;
Andrew JM Halls;
Daniel Becker;
Anne L Glowinski;
Andrew Krystal

ABSTRACT

Background:

Large language models (LLMs), and, more recently, large reasoning models (LRMs) have rapidly garnered significant interest for application in psychiatry and behavioral health. However, recent studies have identified significant shortcomings and potential risks in the performance of LLM-based systems, complicating their application to psychiatric diagnosis. Two promising approaches to addressing these challenges and improving the efficacy of these models are simulated reasoning (SR) and self-verification (SV), in which additional “reasoning tokens” are used to guide model output, either during or after inference.

Objective:

We aimed to explore how the use of SR (via LRMs) and SV (via supplemental prompting) affect the psychiatric diagnostic performance of LLMs.

Methods:

106 case vignettes and associated diagnoses were extracted from the DSM-5-TR Clinical Cases book, with permission. Both an LLM and LRM model were selected from the latest available model generation for each of the two vendors studied (OpenAI and Google). Two inference approaches were developed, a Basic approach that directly prompted models to provide diagnoses, and a SV approach that augmented the Basic approach with additional prompts. All case vignettes were processed by the two LLMs, two LRMs, and two inference approaches, and diagnostic performance was evaluated using the sensitivity and positive predictive value (PPV). Linear mixed effect models were used to test for significant differences between the model vendors (OpenAI, Google), type (LLM, LRM), and addition of an SV prompt.

Results:

All vignettes were successfully processed by each model and inference approach. Sensitivity ranged from 0.732 to 0.817, and PPV ranged from 0.534 to 0.779. The best overall performance was found in the o3-pro LRM using SV, with a sensitivity of 0.782 and a PPV of 0.779. No statistically significant fixed effects were found for sensitivity. For PPV, a statistically significant effect was found for prompt type (SV – coefficient 0.09, p=0.002), model type (LRM – coefficient 0.09, p=0.003), and the interaction between model type and vendor (LRM:OpenAI – coefficient 0.10, p=0.021).

Conclusions:

We found that both SR and SV yielded statistically significant improvements in the PPV, without significant differences in the sensitivity. The addition of the manually specified SV prompt improved the PPV even when simulated reasoning was used. This suggests that future efforts to apply language models in behavioral health may benefit from a combination of manually crafted reasoning prompts and automated SR.

Citation

Please cite as:

Sarma KV, Hanss KE, Halls AJ, Becker D, Glowinski AL, Krystal A

Simulated Reasoning and Self-Verification for Psychiatric Diagnosis in Generalist Large Language Models: Comparative Evaluation

JMIR AI 2026;5:e83927

DOI: 10.2196/83927

PMID: 42258613

PMCID: 13245640

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR AI

Date Submitted: Sep 15, 2025

Date Accepted: May 3, 2026

Simulated Reasoning and Self-Verification for Psychiatric Diagnosis in Generalist Large Language Models: Comparative Evaluation

ABSTRACT

Citation

Copyright