JMIR Preprints #104656: Feasibility of Large Language Models for Neurosurgical Approach Planning: A Multimodal, Multi-Model Evaluation

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Feasibility of Large Language Models for Neurosurgical Approach Planning: A Multimodal, Multi-Model Evaluation

Hao Tseng;
Huan-Chih Wang

ABSTRACT

Background:

Neurosurgical approach planning is a cognitively demanding task with substantial variability between experts. Large language models (LLMs) have shown promise on knowledge-based neurosurgical tasks such as board-style questions, but their ability to generate clinically acceptable surgical approach recommendations—particularly using multimodal (text and imaging) input—has not been systematically evaluated.

Objective:

To evaluate the feasibility and clinical acceptability of large language models (LLMs) as decision-support tools for neurosurgical approach planning, and to examine whether LLM-generated recommendations influence the surgical decision-making of neurosurgeons at different training levels.

Methods:

Sixty intracranial tumor cases spanning five categories (extra-axial deep, extra-axial superficial, intra-axial deep, intra-axial superficial, and pituitary) were curated from a publicly available, de-identified MRI dataset; one case was excluded for a technical error, leaving 59 for analysis. Each case included a synthetic structured clinical summary. Three commercial multimodal LLMs—GPT-5 (LLM-1), Gemini 3 Pro (LLM-2), and Claude Sonnet 4.5 (LLM-3)—were queried under two input conditions presented sequentially within a single conversation: a text-only prompt followed by an MRI-inclusive prompt. Twenty-two neurosurgeons (11 trainees, 11 specialists) independently rated each blinded recommendation on a 5-point Likert scale across four domains (appropriateness, safety, feasibility, clarity) and indicated whether any recommendation influenced their own approach selection. Concordance (appropriateness ≥4), inter-model and modality differences, tumor-category effects, decision influence, and inter-rater reliability (Krippendorff's α) were analyzed.

Results:

A total of 1,296 Likert-scale evaluations were collected. Overall concordance was 76.7% for text-based and 72.4% for MRI-based prompts, both exceeding the predefined 70% threshold; text-based prompts were significantly superior (McNemar exact p=0.003), driven primarily by LLM-3. LLM-2 achieved the highest overall performance (mean 3.97, 78.2% acceptability), with significant inter-model differences (Friedman: appropriateness p=0.008, feasibility p=0.003, clarity p<0.001). Performance varied markedly by tumor category (Kruskal–Wallis p<0.0001): pituitary tumors were highest (87.4% acceptability) and intra-axial superficial tumors lowest (56.5%). Specialists rated recommendations significantly higher than trainees across all domains (all p<0.001), yet trainees were twice as likely to be influenced by them (26.6% vs. 15.0%; Fisher exact p=0.044). Inter-rater agreement was modest overall (α=0.270) but substantially higher among specialists (α=0.439) than trainees (α=0.173).

Conclusions:

Current commercial LLMs can generate clinically acceptable neurosurgical approach recommendations for approximately three-quarters of cases, performing best for standardized pathologies such as pituitary tumors and worst for functionally complex lesions. Performance varies significantly by model, input modality, and tumor complexity, and text-based prompts outperform MRI-inclusive ones. Although not yet suitable as autonomous decision-support tools, LLMs disproportionately influence trainees and show promise as educational adjuncts that complement—rather than replace—expert surgical judgment.

Citation

Please cite as:

Tseng H, Wang HC

Feasibility of Large Language Models for Neurosurgical Approach Planning: A Multimodal, Multi-Model Evaluation

JMIR Preprints. 14/06/2026:104656

DOI: 10.2196/preprints.104656

URL: https://preprints.jmir.org/preprint/104656

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Currently submitted to: JMIR AI

Date Submitted: Jun 14, 2026

Open Peer Review Period: Jun 30, 2026 - Aug 25, 2026

(currently open for review)

Feasibility of Large Language Models for Neurosurgical Approach Planning: A Multimodal, Multi-Model Evaluation

ABSTRACT

Citation

Copyright