Currently submitted to: JMIR AI
Date Submitted: Jun 14, 2026
Open Peer Review Period: Jun 30, 2026 - Aug 25, 2026
(currently open for review)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Feasibility of Large Language Models for Neurosurgical Approach Planning: A Multimodal, Multi-Model Evaluation
ABSTRACT
Background:
Neurosurgical approach planning is a cognitively demanding task with substantial variability between experts. Large language models (LLMs) have shown promise on knowledge-based neurosurgical tasks such as board-style questions, but their ability to generate clinically acceptable surgical approach recommendations—particularly using multimodal (text and imaging) input—has not been systematically evaluated.
Objective:
To evaluate the feasibility and clinical acceptability of large language models (LLMs) as decision-support tools for neurosurgical approach planning, and to examine whether LLM-generated recommendations influence the surgical decision-making of neurosurgeons at different training levels.
Methods:
Sixty intracranial tumor cases spanning five categories (extra-axial deep, extra-axial superficial, intra-axial deep, intra-axial superficial, and pituitary) were curated from a publicly available, de-identified MRI dataset; one case was excluded for a technical error, leaving 59 for analysis. Each case included a synthetic structured clinical summary. Three commercial multimodal LLMs—GPT-5 (LLM-1), Gemini 3 Pro (LLM-2), and Claude Sonnet 4.5 (LLM-3)—were queried under two input conditions presented sequentially within a single conversation: a text-only prompt followed by an MRI-inclusive prompt. Twenty-two neurosurgeons (11 trainees, 11 specialists) independently rated each blinded recommendation on a 5-point Likert scale across four domains (appropriateness, safety, feasibility, clarity) and indicated whether any recommendation influenced their own approach selection. Concordance (appropriateness ≥4), inter-model and modality differences, tumor-category effects, decision influence, and inter-rater reliability (Krippendorff's α) were analyzed.
Results:
A total of 1,296 Likert-scale evaluations were collected. Overall concordance was 76.7% for text-based and 72.4% for MRI-based prompts, both exceeding the predefined 70% threshold; text-based prompts were significantly superior (McNemar exact p=0.003), driven primarily by LLM-3. LLM-2 achieved the highest overall performance (mean 3.97, 78.2% acceptability), with significant inter-model differences (Friedman: appropriateness p=0.008, feasibility p=0.003, clarity p<0.001). Performance varied markedly by tumor category (Kruskal–Wallis p<0.0001): pituitary tumors were highest (87.4% acceptability) and intra-axial superficial tumors lowest (56.5%). Specialists rated recommendations significantly higher than trainees across all domains (all p<0.001), yet trainees were twice as likely to be influenced by them (26.6% vs. 15.0%; Fisher exact p=0.044). Inter-rater agreement was modest overall (α=0.270) but substantially higher among specialists (α=0.439) than trainees (α=0.173).
Conclusions:
Current commercial LLMs can generate clinically acceptable neurosurgical approach recommendations for approximately three-quarters of cases, performing best for standardized pathologies such as pituitary tumors and worst for functionally complex lesions. Performance varies significantly by model, input modality, and tumor complexity, and text-based prompts outperform MRI-inclusive ones. Although not yet suitable as autonomous decision-support tools, LLMs disproportionately influence trainees and show promise as educational adjuncts that complement—rather than replace—expert surgical judgment.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.