Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR AI

Date Submitted: Mar 28, 2025
Date Accepted: Aug 8, 2025

The final, peer-reviewed published version of this preprint can be found here:

Aiding Large Language Models Using Clinical Scoresheets for Neurobehavioral Diagnostic Classification From Text: Algorithm Development and Validation

Lin K, Rasool A, Surabhi S, Mutlu C, Zhang H, Paul DW, Washington P

Aiding Large Language Models Using Clinical Scoresheets for Neurobehavioral Diagnostic Classification From Text: Algorithm Development and Validation

JMIR AI 2025;4:e75030

DOI: 10.2196/75030

PMID: 41118647

PMCID: 12587012

Aiding LLMs with Clinical Scoresheets: No Improvement in Neurobehavioral Diagnostic Classification from Text Using Basic Prompting

  • Kaiying Lin; 
  • Abdur Rasool; 
  • Saimourya Surabhi; 
  • Cezmi Mutlu; 
  • Haopeng Zhang; 
  • Dennis Wall Paul; 
  • Peter Washington

ABSTRACT

Background:

Large language models (LLMs) have demonstrated the ability to perform complex tasks traditionally requiring human intelligence. However, their use in automated diagnostics for psychiatry and behavioral sciences remains understudied.

Objective:

To evaluate whether simple prompting of LLM-based chatbots can support diagnostic classification for neuropsychiatric conditions, and whether providing clinical assessment scales improves performance.

Methods:

We tested two approaches using ChatGPT and Claude: (1) direct diagnostic querying and (2) execution of chatbot-generated code for classification. Three diagnostic datasets were used: ASDBank (autism), AphasiaBank (aphasia), and DAIC-WOZ (depression and related conditions). Each approach was evaluated with and without the aid of clinical assessment scales. Performance was compared to existing machine learning benchmarks on these datasets.

Results:

Across all three datasets, incorporating clinical assessment scales led to modest improvements in performance, but results remained inconsistent and below the performances in prior studies. On the AphasiaBank dataset, the Direct Diagnosis approach with ChatGPT 4 yielded a relatively low F1 score (0.655) and very low specificity (0.33). Using the Code Generation method, performance improved substantially, with ChatGPT 4o achieving an F1 score of 0.814, specificity of 0.786, and sensitivity of 0.843. On the ASDBank dataset, Direct Diagnosis approaches yielded lower F1 scores (0.598 for ChatGPT 4 and 0.514 for ChatGPT 4o), while the Code Generation approach with Claude 3.5 improved performance to an F1 score of 0.6, specificity of 0.67, and sensitivity of 0.69. In the Daic-woz dataset, the Direct Diagnosis method produced high sensitivity (0.939) but very low specificity (0.08) and an F1 score of 0.452 with ChatGPT 4. Code Generation improved specificity (up to 0.886 with ChatGPT 4o), but F1 scores remained low overall, ranging from 0.203 to 0.33. These findings indicate that while clinical scales can help structure outputs in both approaches they do not consistently enable LLMs to reach clinically high diagnostic performance when using simple prompts.

Conclusions:

Current LLM-based chatbots, when prompted naïvely, underperform on psychiatric and behavioral diagnostic tasks compared to specialized machine learning models. Clinical assessment scales might modestly aid chatbot performance, but more sophisticated prompt engineering and domain integration are likely required to reach clinically actionable standards. Clinical Trial: Not applicable.


 Citation

Please cite as:

Lin K, Rasool A, Surabhi S, Mutlu C, Zhang H, Paul DW, Washington P

Aiding Large Language Models Using Clinical Scoresheets for Neurobehavioral Diagnostic Classification From Text: Algorithm Development and Validation

JMIR AI 2025;4:e75030

DOI: 10.2196/75030

PMID: 41118647

PMCID: 12587012

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.