Accepted for/Published in: JMIR Mental Health
Date Submitted: Mar 6, 2024
Date Accepted: Jun 14, 2024
Date Submitted to PubMed: Jun 14, 2024
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Crisis prediction among tele-mental health patients: A large language model and expert clinician comparison
ABSTRACT
Background:
Due to recent advances in artificial intelligence (AI), large language models (LLMs) have emerged as a powerful tool for a variety of language related tasks, including sentiment analysis, and summarization of patient provided text. However, there is limited research on these models in the area of crisis prediction.
Objective:
This study aimed to determine the performance of OpenAI’s GPT-4 in predicting the likelihood of a mental health crisis episode based on patient provided information at intake among users of a national telemental health platform.
Methods:
De-identified patient provided data was pulled from specific intake questions of the Brightside telehealth platform for 260 patients that later indicated they were experiencing suicidal ideation with a plan. 200 patients treated during the same time period who did not in the course of treatment endorse suicidal ideation were also randomly selected. Six Brightside clinicians (three psychologists and three psychiatrists) were shown patients’ self-reported chief complaint and self-reported suicide attempt history but were blinded to the future course of treatment and other reported symptoms including suicidal ideation. They were asked a simple yes/no question regarding their prediction of suicidal ideation with plan along with a confidence level. GPT-4 was prompted with similar information and asked to provide answers to the same questions, enabling us to directly compare their performance.
Results:
Overall accuracy (correctly assigning SI with plan vs no SI in the 460 examples) across the six raters using chief complaint alone ranged from 55.2% to 67% with an average of 62.1%. The GPT-4 based model had 61.5% accuracy. The addition of information regarding previous suicide attempts raised the average accuracy of the clinicians to 67.1% and GPT-4 to 67.0%. While overall performance of the GPT-4 based model approaches that of the clinicians, specificity was significantly lower than average for GPT-4 compared to clinicians in both scenarios (with and without history of previous suicide attempts). Average specificity across clinicians was 83.9% on chief complaint alone compared to 70.5% for the GPT-4 based model. Average specificity with the addition of the history of previous suicide attempts across the clinicians was 85.7% compared with 50.1% for the GPT-4 based model.
Conclusions:
GPT-4 with a simple prompt design produced results on some metrics that approached that of a trained clinician. Additional work must be done before such a model could be piloted in a clinical setting. The model should undergo safety checks for bias given evidence that LLMs can perpetuate the biases of the underlying data they are trained upon. We believe that LLMs hold promise to augment identification of higher risk patients at intake and potentially deliver more timely care to patients.
Citation
