Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Dec 11, 2024
Date Accepted: Apr 28, 2025

The final, peer-reviewed published version of this preprint can be found here:

Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study

Hanss K, Sarma KV, Glowinski AL, Krystal A, Saunders R, Halls A, Gorrell S, Reilly E

Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study

J Med Internet Res 2025;27:e69910

DOI: 10.2196/69910

PMID: 40392576

PMCID: 12134693

Assessing the accuracy and reliability of large language models in psychiatry using standardized multiple-choice questions

  • Kaitlin Hanss; 
  • Karthik V Sarma; 
  • Anne L Glowinski; 
  • Andrew Krystal; 
  • Ramotse Saunders; 
  • Andrew Halls; 
  • Sasha Gorrell; 
  • Erin Reilly

ABSTRACT

Background:

Large language models (LLMs) such as OpenAI’s GPT-3.5, GPT-4, and GPT-4o have garnered early and significant enthusiasm for their potential applications within mental health, ranging from documentation support to chat-bot therapy. Understanding the accuracy and reliability of the psychiatric parametric knowledge encoded in these models, and developing measures of confidence in their responses (i.e., the likelihood that an LLM response is accurate) is crucial for safe and effective integration of these tools into mental health settings.

Objective:

To assess the accuracy, reliability, and predictors of accuracy of GPT-3.5, GPT-4, and GPT-4o on standardized psychiatry multiple-choice questions (MCQs).

Methods:

A cross-sectional study was conducted where three commonly available, commercial LLMs (GPT-3.5, GPT-4, and GPT-4o) were tested for their ability to provide answers to 150 single-answer MCQs extracted from the Psychiatry Test Preparation and Review Manual. Each model generated answers to every MCQ ten times. We evaluated the accuracy and reliability of the answers and sought predictors of answer accuracy. Our primary outcome was the proportion of questions answered correctly by each LLM (accuracy). Secondary measures were (i) response consistency to MCQs across ten trials (reliability), (ii) the correlation between MCQ answer accuracy and response consistency, and (iii) the correlation between MCQ answer accuracy and the models’ self-reported confidence.

Results:

GPT-3.5 answered 58% of MCQs correctly on first attempt, while GPT-4 and GPT-4o achieved 84% and 87% accuracy, respectively; GPT-4 and GPT-4o displayed no difference in performance, but significantly outperformed GPT-3.5 (p < .010). GPT-3.5 exhibited less response consistency on average compared to the other models (p < .010). MCQ response consistency was positively correlated with MCQ accuracy across all models (p < .001), whereas model self-reported confidence showed no correlation with accuracy except for GPT3.5, where self-reported confidence was weakly inversely correlated with correctness (p < .010).

Conclusions:

To our knowledge, this is the first comprehensive evaluation of the psychiatric knowledge encoded in commercially available LLMs and the first to assess their reliability and identify predictors of response accuracy within medical domains. Findings suggest GPT-4 and GPT-4o encode accurate and reliable psychiatric knowledge, and that methods such as repeated prompting may provide a measure of LLM response confidence. This work supports the potential of LLMs in mental health settings and motivates further research to assess their performance in more open-ended clinical contexts.


 Citation

Please cite as:

Hanss K, Sarma KV, Glowinski AL, Krystal A, Saunders R, Halls A, Gorrell S, Reilly E

Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study

J Med Internet Res 2025;27:e69910

DOI: 10.2196/69910

PMID: 40392576

PMCID: 12134693

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.