JMIR Preprints #93890: The Performance of ChatGPT-4o and DeepSeek-R1 in Interpreting Thyroid Nodule Ultrasound Text Report: A Multicenter Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

The Performance of ChatGPT-4o and DeepSeek-R1 in Interpreting Thyroid Nodule Ultrasound Text Report: A Multicenter Study

Yujie Xie;
Jiarui Liu;
Bing Zhan;
Kangfan Zhang;
Yuchen Li;
Chunping Ning

ABSTRACT

Background:

Clinicians exhibit considerable variability in diagnosing and managing thyroid nodules. While large language models (LLMs) show promise in processing medical data, their effectiveness and reliability in standardizing the interpretation of thyroid nodule ultrasound text report have yet to be thoroughly validated.

Objective:

To assess two LLMs, DeepSeek-R1 and ChatGPT-4o, in interpreting thyroid nodule ultrasound text report, emphasizing the accuracy in benign-malignant differentiation, the agreement of Chinese Thyroid Imaging Reporting and Data System (C-TIRADS) classification and management recommendation, and the stability of each task.

Methods:

We analyzed 1,063 ultrasound text reports from three medical centers, with 306 nodules confirmed by histopathology. Each nodule's report was processed through two LLMs using standardized prompts, repeated five times, with the final result determined by mode voting.

Results:

DeepSeek-R1 excelled over ChatGPT-4o in differentiating benign from malignant nodules, with superior sensitivity (0.879 vs. 0.692), accuracy (0.729 vs. 0.644), and Area Under the Curve (AUC) (0.694 vs. 0.632). However, senior radiologists achieved notably better results with higher accuracy (0.804), and AUC (0.865) compared two LLMs. In C-TIRADS classification, DeepSeek-R1 also outperformed ChatGPT-4o (κ=0.770 vs. κ=0.688, Δκ=0.083 [95% CI: 0.048, 0.122]). Both models showed substantial agreement with clinicians on management recommendation (κ=0.606 vs. κ=0.608, Δκ=-0.002 [95% CI: -0.044, 0.041]). In terms of stability, LLMs exhibited almost perfect agreement in C-TIRADS classification (α=0.864 vs. α=0.866, Δα=-0.003 [95% CI: -0.023, 0.017]) and management recommendation (κ=0.853 vs. κ=0.849, Δκ=0.004 [95% CI: -0.026, 0.033]). However, in benign-malignant discrimination, DeepSeek-R1 demonstrated significantly greater stability than ChatGPT-4o (κ=0.849 vs. κ=0.550, Δκ=0.260 [95% CI: 0.191, 0.321]).

Conclusions:

Our study highlights the potential of LLMs for interpreting thyroid nodule ultrasound text reports. DeepSeek-R1 outperformed in benign-malignant differentiation accuracy and classification consistency, whereas ChatGPT-4o and DeepSeek-R1 performed similarly in management recommendation.

Citation

Please cite as:

Xie Y, Liu J, Zhan B, Zhang K, Li Y, Ning C

The Performance of ChatGPT-4o and DeepSeek-R1 in Interpreting Thyroid Nodule Ultrasound Text Reports: Multicenter Study

J Med Internet Res 2026;28:e93890

DOI: 10.2196/93890

PMID: 42520216

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Feb 21, 2026

Open Peer Review Period: Feb 23, 2026 - Apr 20, 2026

Date Accepted: Jun 30, 2026

(closed for review but you can still tweet)

The Performance of ChatGPT-4o and DeepSeek-R1 in Interpreting Thyroid Nodule Ultrasound Text Report: A Multicenter Study

ABSTRACT

Citation

Copyright