Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Mental Health

Date Submitted: Mar 6, 2024
Date Accepted: Jun 14, 2024
Date Submitted to PubMed: Jun 14, 2024

The final, peer-reviewed published version of this preprint can be found here:

Large Language Models Versus Expert Clinicians in Crisis Prediction Among Telemental Health Patients: Comparative Study

Lee C, Mohebbi M, O’Callahaghan E, Winsberg M

Large Language Models Versus Expert Clinicians in Crisis Prediction Among Telemental Health Patients: Comparative Study

JMIR Ment Health 2024;11:e58129

DOI: 10.2196/58129

PMID: 38876484

PMCID: 11329850

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Crisis prediction among tele-mental health patients: A large language model and expert clinician comparison

  • Christine Lee; 
  • Matthew Mohebbi; 
  • Erin O’Callahaghan; 
  • Mirène Winsberg

ABSTRACT

Background:

Due to recent advances in artificial intelligence (AI), large language models (LLMs) have emerged as a powerful tool for a variety of language related tasks, including sentiment analysis, and summarization of patient provided text. However, there is limited research on these models in the area of crisis prediction.

Objective:

This study aimed to determine the performance of OpenAI’s GPT-4 in predicting the likelihood of a mental health crisis episode based on patient provided information at intake among users of a national telemental health platform.

Methods:

De-identified patient provided data was pulled from specific intake questions of the Brightside telehealth platform for 260 patients that later indicated they were experiencing suicidal ideation with a plan. 200 patients treated during the same time period who did not in the course of treatment endorse suicidal ideation were also randomly selected. Six Brightside clinicians (three psychologists and three psychiatrists) were shown patients’ self-reported chief complaint and self-reported suicide attempt history but were blinded to the future course of treatment and other reported symptoms including suicidal ideation. They were asked a simple yes/no question regarding their prediction of suicidal ideation with plan along with a confidence level. GPT-4 was prompted with similar information and asked to provide answers to the same questions, enabling us to directly compare their performance.

Results:

Overall accuracy (correctly assigning SI with plan vs no SI in the 460 examples) across the six raters using chief complaint alone ranged from 55.2% to 67% with an average of 62.1%. The GPT-4 based model had 61.5% accuracy. The addition of information regarding previous suicide attempts raised the average accuracy of the clinicians to 67.1% and GPT-4 to 67.0%. While overall performance of the GPT-4 based model approaches that of the clinicians, specificity was significantly lower than average for GPT-4 compared to clinicians in both scenarios (with and without history of previous suicide attempts). Average specificity across clinicians was 83.9% on chief complaint alone compared to 70.5% for the GPT-4 based model. Average specificity with the addition of the history of previous suicide attempts across the clinicians was 85.7% compared with 50.1% for the GPT-4 based model.

Conclusions:

GPT-4 with a simple prompt design produced results on some metrics that approached that of a trained clinician. Additional work must be done before such a model could be piloted in a clinical setting. The model should undergo safety checks for bias given evidence that LLMs can perpetuate the biases of the underlying data they are trained upon. We believe that LLMs hold promise to augment identification of higher risk patients at intake and potentially deliver more timely care to patients.


 Citation

Please cite as:

Lee C, Mohebbi M, O’Callahaghan E, Winsberg M

Large Language Models Versus Expert Clinicians in Crisis Prediction Among Telemental Health Patients: Comparative Study

JMIR Ment Health 2024;11:e58129

DOI: 10.2196/58129

PMID: 38876484

PMCID: 11329850

Per the author's request the PDF is not available.