JMIR Preprints #68182: Evaluation of Large Language Models for Radiologists’ Support in Multidisciplinary Breast Cancer Teams: A Comparative Study of AI Performance and Human Expertise

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Evaluation of Large Language Models for Radiologists’ Support in Multidisciplinary Breast Cancer Teams: A Comparative Study of AI Performance and Human Expertise

Hong Jiang;
Chun Yang;
Wenbin Zhou;
Cheng-liang Yin;
Shan Zhou;
Rui He;
Guanghui Ran;
Wujie Wang;
Meixian Wu;
Juan Yu

ABSTRACT

Background:

Artificial intelligence (AI), particularly large language models (LLMs), has shown potential across various domains, but their performance in multidisciplinary teams (MDT) remains largely unknown.

Objective:

This study aimed to evaluate LLMs' performance in supporting radiologists within multidisciplinary breast cancer teams, focusing on their roles in clinical decisionmaking and improving patient care.

Methods:

A set of 50 questions related to radiological and breast cancer guidelines was developed to assess breast cancer. These questions were posed to nine LLMs (ChatGPT-4.0, ChatGPT-4o, ChatGPT-4o mini, Claude 3 Opus, Claude 3.5 Sonnet, Gemini 1.5 Pro, Tongyi Qianwen 2.5, ChatGLM, Ernie Bot 3.5) and clinical physicians. Responses were evaluated for accuracy, confidence, and consistency, based on the 2023 NCCN Breast Cancer Guidelines and 2013 ACR BI-RADS recommendations.

Results:

Claude 3 Opus and ChatGPT-4 achieved the highest confidence scores (2.78 and 2.74, respectively). For consistency, Claude 3 Opus and Claude 3.5 Sonnet led with scores of 3.0, followed by ChatGPT-4o and Gemini 1.5 Pro. ChatGPT-4o mini excelled in clinical diagnostics with a score of 3.0, outperforming physician groups.ChatGPT-4 also scored higher than physicians across several areas, while ChatGLM and Ernie Bot 3.5 underperformed. Attending physicians and residents scored similarly, with fellows showing slightly lower scores in radiotherapy.

Conclusions:

LLMs like ChatGPT-4o and Claude 3 Opus show promise for breast cancer diagnostics and MDT support, but they cannot fully replace clinical experience in complex cases. Further AI refinement is needed for clinical applicability.

Citation

Please cite as:

Jiang H, Yang C, Zhou W, Yin Cl, Zhou S, He R, Ran G, Wang W, Wu M, Yu J

Evaluation of Large Language Models for Radiologists’ Support in Multidisciplinary Breast Cancer Teams: Comparative Study

JMIR Med Inform 2026;14:e68182

DOI: 10.2196/68182

PMID: 41628437

PMCID: 12910264

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Nov 16, 2024

Date Accepted: Dec 2, 2025

Evaluation of Large Language Models for Radiologists’ Support in Multidisciplinary Breast Cancer Teams: A Comparative Study of AI Performance and Human Expertise

ABSTRACT

Citation

Copyright