Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Currently submitted to: JMIR AI

Date Submitted: Mar 2, 2026
Open Peer Review Period: Mar 10, 2026 - May 5, 2026
(currently open for review)

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Neurosymbolic Multi-LLM Reasoning Versus Board-Certified Specialists and Uni-LLM Architectures in Septic Arthritis Assessment: A Prospective Multi-Specialty Benchmarking Study with Item-Level and Question-Subtype Analysis

  • Mete Ucdal; 
  • Evren Ekingen; 
  • Ali Niyazi Kurtcebe

ABSTRACT

Background:

Septic arthritis constitutes a rheumatologic emergency that necessitates prompt and precise diagnosis across various medical specialties. The potential for neurosymbolic multi-LLM architectures, which integrate neural language models with formal knowledge-graph reasoning, to match the expertise of board-certified specialists and to outperform single-model (uni-LLM) approaches in clinical vignettes of septic arthritis remains an area for further investigation.

Objective:

This study aimed to evaluate the diagnostic reasoning performance of SepticJoint-Reason, a multi-layer neurosymbolic pipeline, compared with board-certified specialists and constituent uni-LLMs on American Board–style septic arthritis questions, with emphasis on hallucination elimination and the neurosymbolic advantage over both human experts and standalone AI models.

Methods:

We developed SepticJoint-Reason, a five-stage neurosymbolic pipeline integrating Claude Opus 4.5 (Anthropic), GPT-4.1 (OpenAI), and Gemini 2.5 Pro (Google DeepMind) with a Neo4j musculoskeletal infection ontology (52,418 nodes; 203,672 edges), Lean 4–style proof-trace generation, adaptive compute allocation, and hallucination blocking. We benchmarked SepticJoint-Reason against 30 board-certified specialists (10 rheumatologists, 10 orthopedic surgeons, 10 infectious disease physicians) and against each constituent uni-LLM on 30 American Board–style septic arthritis questions across six clinical subtypes: etiology and risk factors, clinical presentation and diagnosis, synovial fluid analysis, microbiology and laboratory, management and antibiotic therapy, and complications and prognosis. Analyses included non-inferiority testing (δ = 5%), Fleiss’ κ (30 raters), Cohen’s κ (435 pairs), inter-specialty ANOVA, item-level concordance, question-subtype analysis, neurosymbolic versus uni-LLM comparisons, error typology with Fisher’s exact tests, ablation with McNemar’s tests, and counterfactual robustness analysis.

Results:

SepticJoint-Reason correctly answered 27 of 30 questions (90.0%; 95% CI, 73.5–97.9). The pooled specialist panel achieved a mean accuracy of 76.8% (95% CI, 74.0–79.4), with individual scores ranging from 63.3% to 90.0%. The pipeline met the non-inferiority threshold and demonstrated statistical superiority (difference, +13.2 percentage points; 95% CI, 7.1–19.3; P<0.001). Uni-LLM accuracies were: Claude Opus 4.5, 73.3%; GPT-4.1, 70.0%; Gemini 2.5 Pro, 66.7%—all significantly inferior to the neurosymbolic pipeline (all McNemar P<0.05). Fleiss’ κ was 0.38 (fair-to-moderate). Question-subtype analysis revealed the pipeline’s greatest advantage on management and antibiotic therapy items (100% vs. 72.0%; P = 0.009) and microbiology questions (100% vs. 70.7%; P = 0.014). Inter-specialist ANOVA showed significant group differences (F = 8.94; P<0.001), with infectious disease specialists achieving the highest accuracy (81.3%). Ablation confirmed knowledge-graph verification as the dominant accuracy driver (+10.0%; McNemar P = 0.004).

Conclusions:

A neurosymbolic multi-LLM reasoning pipeline significantly outperformed both board-certified specialists and constituent uni-LLMs on American Board–style septic arthritis questions. Knowledge-graph–grounded verification and multi-model consensus were the primary drivers of the neurosymbolic advantage, particularly on complex management and microbiological reasoning items.


 Citation

Please cite as:

Ucdal M, Ekingen E, Kurtcebe AN

Neurosymbolic Multi-LLM Reasoning Versus Board-Certified Specialists and Uni-LLM Architectures in Septic Arthritis Assessment: A Prospective Multi-Specialty Benchmarking Study with Item-Level and Question-Subtype Analysis

JMIR Preprints. 02/03/2026:94458

DOI: 10.2196/preprints.94458

URL: https://preprints.jmir.org/preprint/94458

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.