JMIR Preprints #91973: Evaluating the Performance of Large Language Models in Vascular Surgery: A Case Series

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Evaluating the Performance of Large Language Models in Vascular Surgery: A Case Series

Asanka Wijetunga;
Yunyi Wang;
Chantell Yaghi;
Mauro Vicaretti

ABSTRACT

Background:

Large language models (LLMs) such as ChatGPT, Gemini and Claude are increasingly used by clinicians, yet their accuracy, safety and consistency in clinical cases remain poorly defined. Most studies assess LLMs using multiple-choice questions rather than free-response reasoning.

Objective:

This study aims to evaluate the performance and safety of three widely used LLMs in realistic vascular surgery scenarios to assess their fitness-for-use in clinical practice, and to elucidate the barriers that exist to their widespread integration.

Methods:

Forty-two fictitious vascular cases across five major pathologies: acute limb ischaemia (ALI), aortic disease (Ao), chronic limb ischaemic (CLI), diabetic foot infection (DFI) and extracranial cerebrovascular disease (ECD), were independently entered into ChatGPT-5, Gemini 2.5 and Claude Sonnet 4.5 using standardised prompts. Each model answered structured questions covering diagnosis, investigation and management (both operative and non-operative). Responses were scored by a panel /20 using a predefined rubric. Each plan’s overall safety was separately assessed. Comparative analyses utilised t-tests, ANOVA and multivariable logistic regression.

Results:

Mean composite scores by model were 86.5% (ChatGPT), 83.5% (Gemini), and 88.0% (Claude), whilst scores by disease were 82.7% (ALI), 83.8% (Ao), 85.9% (CLI), 86.1% (DFI) and 80.7% (ECD) (p=non-significant). Unsafe plans occurred in 11.9% (ChatGPT), 23.8% (Gemini) and 7.1% (Claude). On multivariable analysis, independent predictors of unsafe outputs were lower composite score (OR 0.47, p=0.001), higher word count (OR 1.003, p=0.001) and ALI (OR 20.9, p<0.001).

Conclusions:

Our findings demonstrate LLMs’ promise in managing routine vascular surgery cases. However, their inconsistent safety profiles and ethical limitations preclude unsupervised clinical use. Rigorous specialty-specific validation is essential before they may be integrated into routine practice.

Citation

Please cite as:

Wijetunga A, Wang Y, Yaghi C, Vicaretti M

Evaluating the Performance of Large Language Models in Vascular Surgery: A Case Series

JMIR Perioperative Medicine. 08/03/2026:91973 (forthcoming/in press)

DOI: 10.2196/91973

URL: https://preprints.jmir.org/preprint/91973

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Currently accepted at: JMIR Perioperative Medicine

Date Submitted: Jan 22, 2026

Open Peer Review Period: Jan 26, 2026 - Feb 20, 2026

Date Accepted: Mar 8, 2026

(closed for review but you can still tweet)

Evaluating the Performance of Large Language Models in Vascular Surgery: A Case Series

ABSTRACT

Citation

Copyright