Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Nov 16, 2024
Date Accepted: Dec 2, 2025
Evaluation of Large Language Models for Radiologists’ Support in Multidisciplinary Breast Cancer Teams: A Comparative Study of AI Performance and Human Expertise
ABSTRACT
Background:
Artificial intelligence (AI), particularly large language models (LLMs), has shown potential across various domains, but their performance in multidisciplinary teams (MDT) remains largely unknown.
Objective:
This study aimed to evaluate LLMs' performance in supporting radiologists within multidisciplinary breast cancer teams, focusing on their roles in clinical decisionmaking and improving patient care.
Methods:
A set of 50 questions related to radiological and breast cancer guidelines was developed to assess breast cancer. These questions were posed to nine LLMs (ChatGPT-4.0, ChatGPT-4o, ChatGPT-4o mini, Claude 3 Opus, Claude 3.5 Sonnet, Gemini 1.5 Pro, Tongyi Qianwen 2.5, ChatGLM, Ernie Bot 3.5) and clinical physicians. Responses were evaluated for accuracy, confidence, and consistency, based on the 2023 NCCN Breast Cancer Guidelines and 2013 ACR BI-RADS recommendations.
Results:
Claude 3 Opus and ChatGPT-4 achieved the highest confidence scores (2.78 and 2.74, respectively). For consistency, Claude 3 Opus and Claude 3.5 Sonnet led with scores of 3.0, followed by ChatGPT-4o and Gemini 1.5 Pro. ChatGPT-4o mini excelled in clinical diagnostics with a score of 3.0, outperforming physician groups.ChatGPT-4 also scored higher than physicians across several areas, while ChatGLM and Ernie Bot 3.5 underperformed. Attending physicians and residents scored similarly, with fellows showing slightly lower scores in radiotherapy.
Conclusions:
LLMs like ChatGPT-4o and Claude 3 Opus show promise for breast cancer diagnostics and MDT support, but they cannot fully replace clinical experience in complex cases. Further AI refinement is needed for clinical applicability.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.