Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Dec 25, 2023
Date Accepted: May 25, 2024
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Evaluating Large Language Models for Automated Reporting and Data Systems Categorization: Cross-Sectional Study

Wu Q, Wu Q, Li H, Wang Y, Bai Y, Wu Y, Yu X, Li X, Dong P, Xue J, Shen D, Wang M

Evaluating Large Language Models for Automated Reporting and Data Systems Categorization: Cross-Sectional Study

JMIR Med Inform 2024;12:e55799

DOI: 10.2196/55799

PMID: 39018102

PMCID: 11292156

Evaluating Large Language Models for Automated Reporting and Data Systems Categorization: Cross-Sectional Study

  • Qingxia Wu; 
  • Qingxia Wu; 
  • Huali Li; 
  • Yan Wang; 
  • Yan Bai; 
  • Yaping Wu; 
  • Xuan Yu; 
  • Xiaodong Li; 
  • Pei Dong; 
  • Jon Xue; 
  • Dinggang Shen; 
  • Meiyun Wang

ABSTRACT

Background:

Large language models (LLMs) show promise for improving radiology workflows, but their performance on structured radiological tasks such as Radiology Reporting and Data Systems (RADS) categorization remains unexplored.

Objective:

To evaluate three LLM chatbots - Claude-2, GPT-3.5, and GPT-4 - on assigning Reporting and Data Systems (RADS) categories to radiology reports and assess the impact of different prompting strategies.

Methods:

This cross-sectional study compared three chatbots using 30 radiology reports (10 per RADS criteria), utilizing a three-level prompting strategy: zero-shot, few-shot, and guideline PDF-informed prompts. The cases were grounded in LI-RADS® CT/MRI v2018, Lung-RADS® v2022, and O-RADS™ MRI, meticulously prepared by board-certified radiologists. Each report underwent six assessments. Two blinded reviewers assessed the chatbots' response at patient-level RADS categorization and overall ratings. The agreement across repetitions was assessed using Fleiss's kappa.

Results:

Claude-2 achieved the highest accuracy in overall ratings with few-shot prompts and guideline PDFs (Prompt-2), attaining 57% (17/30) average accuracy over six runs and 50% (15/30) accuracy with k-pass voting. Without prompt engineering, all chatbots performed poorly. The introduction of a structured exemplar prompt (Prompt-1) increased the accuracy of overall ratings for all chatbots. Providing Prompt-2 further improved Claude-2’s performance, an enhancement not replicated by GPT-4. The inter-run agreement was substantial for Claude-2 (k=0.66 for overall rating, k=0.69 for RADS categorization), fair for GPT-4 (k=0.39 for both), and fair for GPT-3.5 (k=0.21 for overall rating and k=0.39 for RADS categorization). All chatbots showed significantly higher accuracy with LI-RADS v2018 compared to Lung-RADS v2022 and O-RADS (P<.05), with Prompt-2, Claude-2 achieved the highest overall rating accuracy of 75% (45/60) in LI-RADS v2018.

Conclusions:

When equipped with structured prompts and guideline PDFs, Claude-2 demonstrated potential in assigning RADS categories to radiology cases according to established criteria such as LI-RADS v2018. However, the current generation of chatbots lags in accurately categorizing cases based on more recent RADS criteria.


 Citation

Please cite as:

Wu Q, Wu Q, Li H, Wang Y, Bai Y, Wu Y, Yu X, Li X, Dong P, Xue J, Shen D, Wang M

Evaluating Large Language Models for Automated Reporting and Data Systems Categorization: Cross-Sectional Study

JMIR Med Inform 2024;12:e55799

DOI: 10.2196/55799

PMID: 39018102

PMCID: 11292156

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.