Currently submitted to: JMIR AI
Date Submitted: Mar 2, 2026
Open Peer Review Period: Mar 10, 2026 - May 5, 2026
(currently open for review)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Neurosymbolic Multi-LLM Reasoning Versus Board-Certified Specialists and Uni-LLM Architectures in Septic Arthritis Assessment: A Prospective Multi-Specialty Benchmarking Study with Item-Level and Question-Subtype Analysis
ABSTRACT
Background:
Septic arthritis constitutes a rheumatologic emergency that necessitates prompt and precise diagnosis across various medical specialties. The potential for neurosymbolic multi-LLM architectures, which integrate neural language models with formal knowledge-graph reasoning, to match the expertise of board-certified specialists and to outperform single-model (uni-LLM) approaches in clinical vignettes of septic arthritis remains an area for further investigation.
Objective:
This study aimed to evaluate the diagnostic reasoning performance of SepticJoint-Reason, a multi-layer neurosymbolic pipeline, compared with board-certified specialists and constituent uni-LLMs on American Board–style septic arthritis questions, with emphasis on hallucination elimination and the neurosymbolic advantage over both human experts and standalone AI models.
Methods:
We developed SepticJoint-Reason, a five-stage neurosymbolic pipeline integrating Claude Opus 4.5 (Anthropic), GPT-4.1 (OpenAI), and Gemini 2.5 Pro (Google DeepMind) with a Neo4j musculoskeletal infection ontology (52,418 nodes; 203,672 edges), Lean 4–style proof-trace generation, adaptive compute allocation, and hallucination blocking. We benchmarked SepticJoint-Reason against 30 board-certified specialists (10 rheumatologists, 10 orthopedic surgeons, 10 infectious disease physicians) and against each constituent uni-LLM on 30 American Board–style septic arthritis questions across six clinical subtypes: etiology and risk factors, clinical presentation and diagnosis, synovial fluid analysis, microbiology and laboratory, management and antibiotic therapy, and complications and prognosis. Analyses included non-inferiority testing (δ = 5%), Fleiss’ κ (30 raters), Cohen’s κ (435 pairs), inter-specialty ANOVA, item-level concordance, question-subtype analysis, neurosymbolic versus uni-LLM comparisons, error typology with Fisher’s exact tests, ablation with McNemar’s tests, and counterfactual robustness analysis.
Results:
SepticJoint-Reason correctly answered 27 of 30 questions (90.0%; 95% CI, 73.5–97.9). The pooled specialist panel achieved a mean accuracy of 76.8% (95% CI, 74.0–79.4), with individual scores ranging from 63.3% to 90.0%. The pipeline met the non-inferiority threshold and demonstrated statistical superiority (difference, +13.2 percentage points; 95% CI, 7.1–19.3; P<0.001). Uni-LLM accuracies were: Claude Opus 4.5, 73.3%; GPT-4.1, 70.0%; Gemini 2.5 Pro, 66.7%—all significantly inferior to the neurosymbolic pipeline (all McNemar P<0.05). Fleiss’ κ was 0.38 (fair-to-moderate). Question-subtype analysis revealed the pipeline’s greatest advantage on management and antibiotic therapy items (100% vs. 72.0%; P = 0.009) and microbiology questions (100% vs. 70.7%; P = 0.014). Inter-specialist ANOVA showed significant group differences (F = 8.94; P<0.001), with infectious disease specialists achieving the highest accuracy (81.3%). Ablation confirmed knowledge-graph verification as the dominant accuracy driver (+10.0%; McNemar P = 0.004).
Conclusions:
A neurosymbolic multi-LLM reasoning pipeline significantly outperformed both board-certified specialists and constituent uni-LLMs on American Board–style septic arthritis questions. Knowledge-graph–grounded verification and multi-model consensus were the primary drivers of the neurosymbolic advantage, particularly on complex management and microbiological reasoning items.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.