Aiding LLMs with Clinical Scoresheets: No Improvement in Neurobehavioral Diagnostic Classification from Text Using Basic Prompting
ABSTRACT
Background:
Large language models (LLMs) have demonstrated the ability to perform complex tasks traditionally requiring human intelligence. However, their use in automated diagnostics for psychiatry and behavioral sciences remains understudied.
Objective:
To evaluate whether simple prompting of LLM-based chatbots can support diagnostic classification for neuropsychiatric conditions, and whether providing clinical assessment scales improves performance.
Methods:
We tested two approaches using ChatGPT and Claude: (1) direct diagnostic querying and (2) execution of chatbot-generated code for classification. Three diagnostic datasets were used: ASDBank (autism), AphasiaBank (aphasia), and DAIC-WOZ (depression and related conditions). Each approach was evaluated with and without the aid of clinical assessment scales. Performance was compared to existing machine learning benchmarks on these datasets.
Results:
Across all three datasets, incorporating clinical assessment scales led to modest improvements in performance, but results remained inconsistent and below the performances in prior studies. On the AphasiaBank dataset, the Direct Diagnosis approach with ChatGPT 4 yielded a relatively low F1 score (0.655) and very low specificity (0.33). Using the Code Generation method, performance improved substantially, with ChatGPT 4o achieving an F1 score of 0.814, specificity of 0.786, and sensitivity of 0.843. On the ASDBank dataset, Direct Diagnosis approaches yielded lower F1 scores (0.598 for ChatGPT 4 and 0.514 for ChatGPT 4o), while the Code Generation approach with Claude 3.5 improved performance to an F1 score of 0.6, specificity of 0.67, and sensitivity of 0.69. In the Daic-woz dataset, the Direct Diagnosis method produced high sensitivity (0.939) but very low specificity (0.08) and an F1 score of 0.452 with ChatGPT 4. Code Generation improved specificity (up to 0.886 with ChatGPT 4o), but F1 scores remained low overall, ranging from 0.203 to 0.33. These findings indicate that while clinical scales can help structure outputs in both approaches they do not consistently enable LLMs to reach clinically high diagnostic performance when using simple prompts.
Conclusions:
Current LLM-based chatbots, when prompted naïvely, underperform on psychiatric and behavioral diagnostic tasks compared to specialized machine learning models. Clinical assessment scales might modestly aid chatbot performance, but more sophisticated prompt engineering and domain integration are likely required to reach clinically actionable standards. Clinical Trial: Not applicable.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.