JMIR Preprints #69910: Assessing the accuracy and reliability of large language models in psychiatry using standardized multiple-choice questions

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Assessing the accuracy and reliability of large language models in psychiatry using standardized multiple-choice questions

Kaitlin Hanss;
Karthik V Sarma;
Anne L Glowinski;
Andrew Krystal;
Ramotse Saunders;
Andrew Halls;
Sasha Gorrell;
Erin Reilly

ABSTRACT

Background:

Large language models (LLMs) such as OpenAI’s GPT-3.5, GPT-4, and GPT-4o have garnered early and significant enthusiasm for their potential applications within mental health, ranging from documentation support to chat-bot therapy. Understanding the accuracy and reliability of the psychiatric parametric knowledge encoded in these models, and developing measures of confidence in their responses (i.e., the likelihood that an LLM response is accurate) is crucial for safe and effective integration of these tools into mental health settings.

Objective:

To assess the accuracy, reliability, and predictors of accuracy of GPT-3.5, GPT-4, and GPT-4o on standardized psychiatry multiple-choice questions (MCQs).

Methods:

A cross-sectional study was conducted where three commonly available, commercial LLMs (GPT-3.5, GPT-4, and GPT-4o) were tested for their ability to provide answers to 150 single-answer MCQs extracted from the Psychiatry Test Preparation and Review Manual. Each model generated answers to every MCQ ten times. We evaluated the accuracy and reliability of the answers and sought predictors of answer accuracy. Our primary outcome was the proportion of questions answered correctly by each LLM (accuracy). Secondary measures were (i) response consistency to MCQs across ten trials (reliability), (ii) the correlation between MCQ answer accuracy and response consistency, and (iii) the correlation between MCQ answer accuracy and the models’ self-reported confidence.

Results:

GPT-3.5 answered 58% of MCQs correctly on first attempt, while GPT-4 and GPT-4o achieved 84% and 87% accuracy, respectively; GPT-4 and GPT-4o displayed no difference in performance, but significantly outperformed GPT-3.5 (p < .010). GPT-3.5 exhibited less response consistency on average compared to the other models (p < .010). MCQ response consistency was positively correlated with MCQ accuracy across all models (p < .001), whereas model self-reported confidence showed no correlation with accuracy except for GPT3.5, where self-reported confidence was weakly inversely correlated with correctness (p < .010).

Conclusions:

To our knowledge, this is the first comprehensive evaluation of the psychiatric knowledge encoded in commercially available LLMs and the first to assess their reliability and identify predictors of response accuracy within medical domains. Findings suggest GPT-4 and GPT-4o encode accurate and reliable psychiatric knowledge, and that methods such as repeated prompting may provide a measure of LLM response confidence. This work supports the potential of LLMs in mental health settings and motivates further research to assess their performance in more open-ended clinical contexts.

Citation

Please cite as:

Hanss K, Sarma KV, Glowinski AL, Krystal A, Saunders R, Halls A, Gorrell S, Reilly E

Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study

J Med Internet Res 2025;27:e69910

DOI: 10.2196/69910

PMID: 40392576

PMCID: 12134693

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Dec 11, 2024

Date Accepted: Apr 28, 2025

Assessing the accuracy and reliability of large language models in psychiatry using standardized multiple-choice questions

ABSTRACT

Citation

Copyright