Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Cancer

Date Submitted: Oct 27, 2025
Date Accepted: Jan 30, 2026

The final, peer-reviewed published version of this preprint can be found here:

Evaluation of GPT-5 for Esophageal Cancer Staging Using Fluorodeoxyglucose Positron Emission Tomography Maximum-Intensity Projection Images: Comparative Pilot Study

Maruyama H, Toyama Y, Araki Y, Takanami K, Ito M, Nakajima Y, Takase K, Kamei T

Evaluation of GPT-5 for Esophageal Cancer Staging Using Fluorodeoxyglucose Positron Emission Tomography Maximum-Intensity Projection Images: Comparative Pilot Study

JMIR Cancer 2026;12:e86630

DOI: 10.2196/86630

PMID: 41729569

PMCID: 12972682

Evaluation of GPT-5 for Esophageal Cancer Staging Using Fluorodeoxyglucose Positron Emission Tomography Maximum-Intensity Projection Images: A Comparative Pilot Study

  • Hiroki Maruyama; 
  • Yoshitaka Toyama; 
  • Yuya Araki; 
  • Kentaro Takanami; 
  • Masato Ito; 
  • Yumi Nakajima; 
  • Kei Takase; 
  • Takashi Kamei

ABSTRACT

Background:

Accurate esophageal cancer staging relies on fluorodeoxyglucose positron emission tomography (FDG-PET), but its interpretation is complex and time-intensive. This diagnostic burden is exacerbated by significant workforce shortages in both radiology and surgery, creating a need for automated support systems. The emergence of advanced large language models (LLMs) has raised expectations for their potential to fulfill this role in complex medical tasks.

Objective:

We evaluated the diagnostic accuracy of LLMs for staging esophageal cancer using fluorodeoxyglucose positron emission tomography (FDG-PET) images, with a focus on their ability to assess lymph nodes (LNs; cN) and distant metastases (cM) for automated radiology reporting.

Methods:

This retrospective study included 120 consecutive adult patients who were diagnosed with esophageal squamous cell carcinoma (SCC) and underwent FDG-PET/computed tomography at Tohoku University Hospital between January 2019 and December 2021. Patients with prior treatment, non-SCC histology, or blood glucose levels ≥ 200 mg/dL were excluded. Frontal maximum-intensity projection PET images were extracted, standardized, and analyzed along with information regarding the tumor location. Six LLMs (ChatGPT-5, ChatGPT-4.5, ChatGPT-4.1, OpenAI o3, o1, and ChatGPT-4 Turbo) and four blinded human evaluators (a nuclear medicine specialist, a gastrointestinal surgeon, and two radiology residents) assessed the presence of thoracic and abdominal LN metastases and determined cN and cM staging. The model analyses were performed using the application programming interface in a zero-shot setting. Diagnostic agreement and accuracy were evaluated using Cohen’s kappa, Cochran’s Q test, and post-hoc McNemar tests with Holm–Bonferroni correction; significance was set at < 0.05.

Results:

The average accuracy was 34–78% for LLMs and 60–85% for physicians, with significantly higher accuracy for physicians in the thoracic LN, abdominal LN, and cN stages. Among the LLMs, GPT-5 demonstrated the highest overall accuracy, with newer LLMs approaching physician-level performance in identifying abdominal LN metastases and cM staging, though they showed weaker consistency for cN staging.

Conclusions:

Although current LLMs have not yet reached physician-level accuracy in comprehensive staging, recent models show promise in assisting with specific diagnostic tasks.


 Citation

Please cite as:

Maruyama H, Toyama Y, Araki Y, Takanami K, Ito M, Nakajima Y, Takase K, Kamei T

Evaluation of GPT-5 for Esophageal Cancer Staging Using Fluorodeoxyglucose Positron Emission Tomography Maximum-Intensity Projection Images: Comparative Pilot Study

JMIR Cancer 2026;12:e86630

DOI: 10.2196/86630

PMID: 41729569

PMCID: 12972682

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.