Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: May 25, 2025
Date Accepted: Dec 11, 2025
Date Submitted to PubMed: Dec 18, 2025

The final, peer-reviewed published version of this preprint can be found here:

Large Language Models for Psychiatric Diagnosis Based on Multicenter Real-World Clinical Records: Comparative Study

Huang G, Sun M, Yu J, Long Z, Yang Y, Xiao T, Liang J, Feng J, Deng H

Large Language Models for Psychiatric Diagnosis Based on Multicenter Real-World Clinical Records: Comparative Study

JMIR Med Inform 2026;14:e77699

DOI: 10.2196/77699

PMID: 41408781

PMCID: 12848494

Large Language Models for Psychiatric Diagnosis Based on Multicenter Real-World Clinical Records: A Comparative Study

  • Guoping Huang; 
  • Maoqian Sun; 
  • Jia Yu; 
  • Zhuhong Long; 
  • Yun Yang; 
  • Tao Xiao; 
  • Jiaquan Liang; 
  • Jun Feng; 
  • Huaili Deng

ABSTRACT

Background:

Psychiatric disorders are common but diagnostically complex, with heavy dependence on the clinical experience of physicians. In cases of a shortage of medical resources, ensuring accurate and timely diagnoses becomes particularly challenging. Recently, large language models (LLMs) have exhibited promising potential in simulated datasets, offering a new technological path for diagnosing psychiatric disorders. However, their overall performance remains insufficiently evaluated in real-world clinical settings.

Objective:

This study aims to systematically evaluate the diagnostic performance of large language models (LLMs) in real-world psychiatric clinical settings. By integrating electronic health records from six provincial and municipal psychiatric centers in China, we compared the diagnostic capabilities of GPT-4.0, GPT-3.5, and the Chinese model GLM-4-Plus across different age groups and psychiatric disorder categories. The objective is to explore how LLMs adapt to diverse patient populations and psychiatric conditions in practical clinical applications, providing evidence for the utility of LLMs in intelligent diagnostic support for mental disorders.

Methods:

We conducted a retrospective study involving the electronic health records (EHRs) of 9,923 psychiatric inpatients from six provincial and municipal-level psychiatric centers in China. The data were collected between July 2017 and July 2024. The dataset covered all psychiatric disorder categories in ICD-10 (F0–F9). We evaluated the diagnostic capabilities of three representative LLMs, namely GPT-4.0, GPT-3.5, and GLM-4-Plus. Experienced psychiatrists confirmed standard diagnoses. Precision, recall, and F1-score were used to assess model outputs. Stratified comparisons were made based on the disorder type and age group.

Results:

The diagnostic accuracy of the LLMs was the highest in the elderly patient group (74.3%–79.5%). However, the accuracy was lower in the adolescent and young adult groups (53.7%–61.0% and 66.4%–70.0%, respectively). Between-group comparisons revealed statistically significant differences in the scores between the adolescent, middle-aged, and elderly groups and between the young adult and middle-aged and elderly groups (P < 0.001 for all comparisons). Among all models, GPT-4.0 exhibited the best diagnostic performance (overall accuracy = 71.7% and weighted average F1 score = 0.881), outperforming GPT-3.5 (68.8%, F1 = 0.849) and GLM-4-Plus (69.3%, F1 = 0.873). The performance of GPT-4.0 was the best in high-prevalence categories such as mood disorders (F3) and schizophrenia spectrum disorders (F2). However, all three models exhibited varying performance fluctuations in lower-prevalence categories.

Conclusions:

Patient age significantly affects the performance of LLMs in diagnostic assistance, with elderly patients exhibiting higher diagnostic accuracy compared with younger patients. GPT-4.0 exhibits outstanding accuracy and stability in identifying psychiatric disorders, effectively helping physicians in decision-making in clinical settings, decreasing cognitive load, and improving the overall efficiency and standardization of psychiatric diagnostic processes. Clinical Trial: Psychiatric disorders; Large language models; Artificial intelligence; Multicenter study; Real-world data


 Citation

Please cite as:

Huang G, Sun M, Yu J, Long Z, Yang Y, Xiao T, Liang J, Feng J, Deng H

Large Language Models for Psychiatric Diagnosis Based on Multicenter Real-World Clinical Records: Comparative Study

JMIR Med Inform 2026;14:e77699

DOI: 10.2196/77699

PMID: 41408781

PMCID: 12848494

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.