Evaluation of GPT-5 for Esophageal Cancer Staging Using Fluorodeoxyglucose Positron Emission Tomography Maximum-Intensity Projection Images: A Comparative Pilot Study
ABSTRACT
Background:
Accurate esophageal cancer staging relies on fluorodeoxyglucose positron emission tomography (FDG-PET), but its interpretation is complex and time-intensive. This diagnostic burden is exacerbated by significant workforce shortages in both radiology and surgery, creating a need for automated support systems. The emergence of advanced large language models (LLMs) has raised expectations for their potential to fulfill this role in complex medical tasks.
Objective:
We evaluated the diagnostic accuracy of LLMs for staging esophageal cancer using fluorodeoxyglucose positron emission tomography (FDG-PET) images, with a focus on their ability to assess lymph nodes (LNs; cN) and distant metastases (cM) for automated radiology reporting.
Methods:
This retrospective study included 120 consecutive adult patients who were diagnosed with esophageal squamous cell carcinoma (SCC) and underwent FDG-PET/computed tomography at Tohoku University Hospital between January 2019 and December 2021. Patients with prior treatment, non-SCC histology, or blood glucose levels ≥ 200 mg/dL were excluded. Frontal maximum-intensity projection PET images were extracted, standardized, and analyzed along with information regarding the tumor location. Six LLMs (ChatGPT-5, ChatGPT-4.5, ChatGPT-4.1, OpenAI o3, o1, and ChatGPT-4 Turbo) and four blinded human evaluators (a nuclear medicine specialist, a gastrointestinal surgeon, and two radiology residents) assessed the presence of thoracic and abdominal LN metastases and determined cN and cM staging. The model analyses were performed using the application programming interface in a zero-shot setting. Diagnostic agreement and accuracy were evaluated using Cohen’s kappa, Cochran’s Q test, and post-hoc McNemar tests with Holm–Bonferroni correction; significance was set at < 0.05.
Results:
The average accuracy was 34–78% for LLMs and 60–85% for physicians, with significantly higher accuracy for physicians in the thoracic LN, abdominal LN, and cN stages. Among the LLMs, GPT-5 demonstrated the highest overall accuracy, with newer LLMs approaching physician-level performance in identifying abdominal LN metastases and cM staging, though they showed weaker consistency for cN staging.
Conclusions:
Although current LLMs have not yet reached physician-level accuracy in comprehensive staging, recent models show promise in assisting with specific diagnostic tasks.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.