Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Currently submitted to: JMIR AI

Date Submitted: Jun 14, 2026
Open Peer Review Period: Jun 30, 2026 - Aug 25, 2026
(currently open for review)

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Feasibility of Large Language Models for Neurosurgical Approach Planning: A Multimodal, Multi-Model Evaluation

  • Hao Tseng; 
  • Huan-Chih Wang

ABSTRACT

Background:

Neurosurgical approach planning is a cognitively demanding task with substantial variability between experts. Large language models (LLMs) have shown promise on knowledge-based neurosurgical tasks such as board-style questions, but their ability to generate clinically acceptable surgical approach recommendations—particularly using multimodal (text and imaging) input—has not been systematically evaluated.

Objective:

To evaluate the feasibility and clinical acceptability of large language models (LLMs) as decision-support tools for neurosurgical approach planning, and to examine whether LLM-generated recommendations influence the surgical decision-making of neurosurgeons at different training levels.

Methods:

Sixty intracranial tumor cases spanning five categories (extra-axial deep, extra-axial superficial, intra-axial deep, intra-axial superficial, and pituitary) were curated from a publicly available, de-identified MRI dataset; one case was excluded for a technical error, leaving 59 for analysis. Each case included a synthetic structured clinical summary. Three commercial multimodal LLMs—GPT-5 (LLM-1), Gemini 3 Pro (LLM-2), and Claude Sonnet 4.5 (LLM-3)—were queried under two input conditions presented sequentially within a single conversation: a text-only prompt followed by an MRI-inclusive prompt. Twenty-two neurosurgeons (11 trainees, 11 specialists) independently rated each blinded recommendation on a 5-point Likert scale across four domains (appropriateness, safety, feasibility, clarity) and indicated whether any recommendation influenced their own approach selection. Concordance (appropriateness ≥4), inter-model and modality differences, tumor-category effects, decision influence, and inter-rater reliability (Krippendorff's α) were analyzed.

Results:

A total of 1,296 Likert-scale evaluations were collected. Overall concordance was 76.7% for text-based and 72.4% for MRI-based prompts, both exceeding the predefined 70% threshold; text-based prompts were significantly superior (McNemar exact p=0.003), driven primarily by LLM-3. LLM-2 achieved the highest overall performance (mean 3.97, 78.2% acceptability), with significant inter-model differences (Friedman: appropriateness p=0.008, feasibility p=0.003, clarity p<0.001). Performance varied markedly by tumor category (Kruskal–Wallis p<0.0001): pituitary tumors were highest (87.4% acceptability) and intra-axial superficial tumors lowest (56.5%). Specialists rated recommendations significantly higher than trainees across all domains (all p<0.001), yet trainees were twice as likely to be influenced by them (26.6% vs. 15.0%; Fisher exact p=0.044). Inter-rater agreement was modest overall (α=0.270) but substantially higher among specialists (α=0.439) than trainees (α=0.173).

Conclusions:

Current commercial LLMs can generate clinically acceptable neurosurgical approach recommendations for approximately three-quarters of cases, performing best for standardized pathologies such as pituitary tumors and worst for functionally complex lesions. Performance varies significantly by model, input modality, and tumor complexity, and text-based prompts outperform MRI-inclusive ones. Although not yet suitable as autonomous decision-support tools, LLMs disproportionately influence trainees and show promise as educational adjuncts that complement—rather than replace—expert surgical judgment.


 Citation

Please cite as:

Tseng H, Wang HC

Feasibility of Large Language Models for Neurosurgical Approach Planning: A Multimodal, Multi-Model Evaluation

JMIR Preprints. 14/06/2026:104656

DOI: 10.2196/preprints.104656

URL: https://preprints.jmir.org/preprint/104656

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.