JMIR Preprints #92925: Occam’s Razor in AI-assisted complex diagnosis: a comparative effectiveness study of single large language models versus multi-agent systems in resource-constrained primary care settings

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Occam’s Razor in AI-assisted complex diagnosis: a comparative effectiveness study of single large language models versus multi-agent systems in resource-constrained primary care settings

Tengfei Cai;
Naiguang Zhang;
Yansheng LI;
Xiaoyan Li

ABSTRACT

Background:

Primary care physicians in resource-constrained settings, particularly within low-income and middle-income countries (LMICs), frequently encounter a "diagnostic gap" when managing complex, rare, or multisystemic pathologies. While Large Language Models (LLMs) demonstrate significant potential to augment clinical reasoning, current state-of-the-art solutions rely predominantly on high-bandwidth cloud infrastructure, limiting their deployment in regions with unstable internet connectivity and strict data sovereignty regulations.

Objective:

The prevailing technological consensus in computer science suggests that "Agentic Workflows" or Multi-Agent Systems (MAS)—which orchestrate multiple models to simulate collective reasoning—inherently offer superior accuracy and safety compared to single models. However, the comparative efficacy, safety, and cost-effectiveness of complex MAS versus single localised models in offline, hardware-limited environments remain unproven.

Methods:

We conducted a prospective comparative benchmarking study using the DiagnosisArena dataset, comprising 915 complex clinical cases across 28 medical specialties. To simulate a secure, offline primary care environment, we evaluated five locally deployed single open-source LLMs (GPT-oss-20b Llama3.1-70B, Qwen3-32B, DeepSeek-R1-32B, Gemma3-27B) against two Multi-Agent architectures: a Standard voting ensemble and a novel hierarchical Adaptive Weighted System. All models were hosted on a local server (4×NVIDIA A100) using the Dify platform. Performance was adjudicated against a Reference Standard established by the consensus of three board-certified physicians using a dual-metric system: a 10-point Diagnostic Recall Scale and a comprehensive Hallucination/Safety Index. Inference latency and computational resource utilisation were recorded to assess cost-effectiveness.

Results:

Contrary to the hypothesis that architectural complexity yields diagnostic precision, single high-performance models significantly outperformed complex ensembles. The single GPT-oss-20b model achieved the highest Diagnostic Recall Score (mean 4.68 [SD 3.82]), statistically surpassing the Adaptive Weighted Multi-Agent System (4.13 [SD 3.43]; p<0.001) and smaller models such as Gemma3-27B (2.89 [SD 3.89]; p<0.001). The Adaptive System, despite utilising dynamic routing, failed to outperform the median score of human physicians (4.22 [SD 3.62]; p=0.432). Furthermore, the inclusion of mid-tier models in the adaptive workflow introduced an "ensemble degradation" effect, significantly lowering the Safety Score compared to the single GPT-oss-20b model (4.99 vs 5.50; p<0.001) and reducing the rate of Top-1 correct diagnoses from 51.58% to 46.89%. Crucially, the single GPT-oss-20b model demonstrated superior efficiency with an average inference time of 30 seconds per case, compared to 200 seconds for the Standard Multi-Agent System—representing an 85% reduction in latency.

Conclusions:

In the context of clinical diagnosis, architectural complexity does not equate to clinical utility. We identified a phenomenon of "ensemble degradation," where integrating mid-tier models into ensembles dilutes the reasoning capabilities of strong base models through the introduction of diagnostic noise. For global health equity, implementation strategies should prioritise "Lean AI"—localising a single, robust open-source model—rather than orchestrating computationally expensive agent swarms. This approach provides a safer, more accurate, and scientifically validated path for bridging the diagnostic gap in resource-constrained primary care.

Citation

Please cite as:

Cai T, Zhang N, LI Y, Li X

Occam’s Razor in AI-assisted complex diagnosis: a comparative effectiveness study of single large language models versus multi-agent systems in resource-constrained primary care settings

JMIR Preprints. 05/02/2026:92925

DOI: 10.2196/preprints.92925

URL: https://preprints.jmir.org/preprint/92925

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Currently submitted to: Journal of Medical Internet Research

Date Submitted: Feb 5, 2026

Open Peer Review Period: Feb 6, 2026 - Apr 3, 2026

(closed for review but you can still tweet)

NOTE: This is an unreviewed Preprint

Occam’s Razor in AI-assisted complex diagnosis: a comparative effectiveness study of single large language models versus multi-agent systems in resource-constrained primary care settings

ABSTRACT

Citation

Copyright