Currently accepted at: JMIR Formative Research
Date Submitted: Nov 5, 2025
Date Accepted: Apr 3, 2026
This paper has been accepted and is currently in production.
It will appear shortly on 10.2196/87163
The final accepted version (not copyedited yet) is in this tab.
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Symptom-Only Localization of Brainstem Ischemia: Large Language Models vs. Neurologists in 109 Diffusion-Weighted Imaging–Positive Cases: A Retrospective Study
ABSTRACT
Background:
Localizing brainstem ischemic lesions based solely on neurological symptoms is challenging due to the complex anatomy and variable symptom presentation. Large language models (LLMs) take an emerging role in medical diagnostics by identifying patterns within clinical narratives.
Objective:
This study evaluates the diagnostic accuracy of LLMs compared to neurologists.
Methods:
We retrospectively analyzed 109 patients with diffusion-weighted imaging (DWI)-confirmed acute brainstem ischemia. Three neurologists and six LLMs (GPT-5, GPT-4, GPT-4.1, GPT-4o, o3, o3 pro) predicted lesion localization (midbrain, pons, medulla) and laterality (left/right) based on clinical symptoms alone. Accuracy, Cohen’s κ, regional performance, and correlations with symptom count were assessed, pairwise Chi2 tests with FDR corrections were performed to compare model performances.
Results:
GPT-4 and GPT-4o achieved the highest overall accuracy (56.0 %, 95 % CI 46.1–65.5), significantly outperforming all neurologists (χ² = 7.4–20.1, p < 0.01) and reasoning-based models. No significant differences were observed among GPT-4, GPT-4o, GPT-4.1, and GPT-5 (p > 0.05). In regional analysis, significant effects were restricted to pontine infarcts, where GPT-4 (74 %) and GPT-4o (69 %) exceeded all neurologists (χ² = 6.4–18.3, p < 0.01). For mesencephalic and medullary lesions, accuracies did not differ significantly (p > 0.05). GPT-o3 pro performed worst overall (10 %, p < 0.001). Cohen’s κ reached 0.29 for GPT-4o, and accuracy correlated with symptom count (r = 0.28, p < 0.01).
Conclusions:
GPT-4, and GPT-4o outperformed experienced neurologists in this constrained diagnostic task. Accuracy remained modest, particularly for non-pontine lesions, and reasoning-augmented models did not improve additional benefit. These findings highlight both the potential and current limitations of LLMs in clinical reasoning, reinforcing the need for multimodal input and prospective validation.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.