JMIR Preprints #94458: Neurosymbolic Multi-LLM Reasoning Versus Board-Certified Specialists and Uni-LLM Architectures in Septic Arthritis Assessment: A Prospective Multi-Specialty Benchmarking Study with Item-Level and Question-Subtype Analysis

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Neurosymbolic Multi-LLM Reasoning Versus Board-Certified Specialists and Uni-LLM Architectures in Septic Arthritis Assessment: A Prospective Multi-Specialty Benchmarking Study with Item-Level and Question-Subtype Analysis

Mete Ucdal;
Evren Ekingen;
Ali Niyazi Kurtcebe

ABSTRACT

Background:

Septic arthritis constitutes a rheumatologic emergency that necessitates prompt and precise diagnosis across various medical specialties. The potential for neurosymbolic multi-LLM architectures, which integrate neural language models with formal knowledge-graph reasoning, to match the expertise of board-certified specialists and to outperform single-model (uni-LLM) approaches in clinical vignettes of septic arthritis remains an area for further investigation.

Objective:

This study aimed to evaluate the diagnostic reasoning performance of SepticJoint-Reason, a multi-layer neurosymbolic pipeline, compared with board-certified specialists and constituent uni-LLMs on American Board–style septic arthritis questions, with emphasis on hallucination elimination and the neurosymbolic advantage over both human experts and standalone AI models.

Methods:

We developed SepticJoint-Reason, a five-stage neurosymbolic pipeline integrating Claude Opus 4.5 (Anthropic), GPT-4.1 (OpenAI), and Gemini 2.5 Pro (Google DeepMind) with a Neo4j musculoskeletal infection ontology (52,418 nodes; 203,672 edges), Lean 4–style proof-trace generation, adaptive compute allocation, and hallucination blocking. We benchmarked SepticJoint-Reason against 30 board-certified specialists (10 rheumatologists, 10 orthopedic surgeons, 10 infectious disease physicians) and against each constituent uni-LLM on 30 American Board–style septic arthritis questions across six clinical subtypes: etiology and risk factors, clinical presentation and diagnosis, synovial fluid analysis, microbiology and laboratory, management and antibiotic therapy, and complications and prognosis. Analyses included non-inferiority testing (δ = 5%), Fleiss’ κ (30 raters), Cohen’s κ (435 pairs), inter-specialty ANOVA, item-level concordance, question-subtype analysis, neurosymbolic versus uni-LLM comparisons, error typology with Fisher’s exact tests, ablation with McNemar’s tests, and counterfactual robustness analysis.

Results:

SepticJoint-Reason correctly answered 27 of 30 questions (90.0%; 95% CI, 73.5–97.9). The pooled specialist panel achieved a mean accuracy of 76.8% (95% CI, 74.0–79.4), with individual scores ranging from 63.3% to 90.0%. The pipeline met the non-inferiority threshold and demonstrated statistical superiority (difference, +13.2 percentage points; 95% CI, 7.1–19.3; P<0.001). Uni-LLM accuracies were: Claude Opus 4.5, 73.3%; GPT-4.1, 70.0%; Gemini 2.5 Pro, 66.7%—all significantly inferior to the neurosymbolic pipeline (all McNemar P<0.05). Fleiss’ κ was 0.38 (fair-to-moderate). Question-subtype analysis revealed the pipeline’s greatest advantage on management and antibiotic therapy items (100% vs. 72.0%; P = 0.009) and microbiology questions (100% vs. 70.7%; P = 0.014). Inter-specialist ANOVA showed significant group differences (F = 8.94; P<0.001), with infectious disease specialists achieving the highest accuracy (81.3%). Ablation confirmed knowledge-graph verification as the dominant accuracy driver (+10.0%; McNemar P = 0.004).

Conclusions:

A neurosymbolic multi-LLM reasoning pipeline significantly outperformed both board-certified specialists and constituent uni-LLMs on American Board–style septic arthritis questions. Knowledge-graph–grounded verification and multi-model consensus were the primary drivers of the neurosymbolic advantage, particularly on complex management and microbiological reasoning items.

Citation

Please cite as:

Ucdal M, Ekingen E, Kurtcebe AN

Neurosymbolic Multi-LLM Reasoning Versus Board-Certified Specialists and Uni-LLM Architectures in Septic Arthritis Assessment: A Prospective Multi-Specialty Benchmarking Study with Item-Level and Question-Subtype Analysis

JMIR Preprints. 02/03/2026:94458

DOI: 10.2196/preprints.94458

URL: https://preprints.jmir.org/preprint/94458

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Currently submitted to: JMIR AI

Date Submitted: Mar 2, 2026

Open Peer Review Period: Mar 10, 2026 - May 5, 2026

(currently open for review)

Neurosymbolic Multi-LLM Reasoning Versus Board-Certified Specialists and Uni-LLM Architectures in Septic Arthritis Assessment: A Prospective Multi-Specialty Benchmarking Study with Item-Level and Question-Subtype Analysis

ABSTRACT

Citation

Copyright