JMIR Preprints #55799: Evaluating Large Language Models for Automated Reporting and Data Systems Categorization

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Evaluating Large Language Models for Automated Reporting and Data Systems Categorization

Qingxia Wu;
Qingxia Wu;
Huali Li;
Yan Wang;
Yan Bai;
Yaping Wu;
Xuan Yu;
Pei Dong;
Jon Xue;
Dinggang Shen;
Meiyun Wang

ABSTRACT

Background:

Large language models (LLMs) show promise for improving radiology workflows, but their performance on structured radiological tasks such as Radiology Reporting and Data Systems (RADS) categorization remains unexplored.

Objective:

To evaluate three LLM chatbots - Claude-2, GPT-3.5, and GPT-4 - on assigning Reporting and Data Systems (RADS) categories to simulated radiology cases and assess the impact of different prompting strategies.

Methods:

This cross-sectional study compared three chatbots using 30 simulated radiology reports (10 per RADS criteria), utilizing a three-level prompting strategy: zero-shot, few-shot, and guideline PDF-informed prompts. The cases were grounded in LI-RADS® CT/MRI v2018, Lung-RADS® v2022, and O-RADS™ MRI, meticulously prepared by board-certified radiologists. Each report underwent six assessments. Two blinded reviewers assessed the chatbots' response at patient-level RADS categorization and overall ratings. The agreement across repetitions was assessed using Fleiss's kappa.

Results:

Claude-2 achieved the highest accuracy in overall ratings with few-shot prompts and guideline PDFs (Prompt-2), attaining 57% (17/30) average accuracy over six runs and 50% (15/30) accuracy with k-pass voting. Without prompt engineering, all chatbots performed poorly. The introduction of a structured exemplar prompt (Prompt-1) increased the accuracy of overall ratings for all chatbots. Providing Prompt-2 further improved Claude-2’s performance, an enhancement not replicated by GPT-4. The inter-run agreement was substantial for Claude-2 (k=0.66 for overall rating, k=0.69 for RADS categorization), fair for GPT-4 (k=0.39 for both), and fair for GPT-3.5 (k=0.21 for overall rating and k=0.39 for RADS categorization). All chatbots showed significantly higher accuracy with LI-RADS v2018 compared to Lung-RADS v2022 and O-RADS (p<0.05).

Conclusions:

When equipped with structured prompts and guideline PDFs, Claude-2 demonstrated potential in assigning RADS categories to radiology cases according to established criteria such as LI-RADS v2018. However, the current generation of chatbots lags in accurately categorizing cases based on more recent RADS criteria.

Citation

Please cite as:

Wu Q, Wu Q, Li H, Wang Y, Bai Y, Wu Y, Yu X, Dong P, Xue J, Shen D, Wang M

Evaluating Large Language Models for Automated Reporting and Data Systems Categorization: Cross-Sectional Study

JMIR Med Inform 2024;12:e55799

DOI: 10.2196/55799

PMID: 39018102

PMCID: 11292156

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Dec 25, 2023

Date Accepted: May 25, 2024

(closed for review but you can still tweet)

Evaluating Large Language Models for Automated Reporting and Data Systems Categorization

ABSTRACT

Citation

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Dec 25, 2023

Date Accepted: May 25, 2024

(closed for review but you can still tweet)

Evaluating Large Language Models for Automated Reporting and Data Systems Categorization

ABSTRACT

Citation

Per the author's request the PDF is not available.