Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Currently submitted to: Journal of Medical Internet Research

Date Submitted: Feb 5, 2026
Open Peer Review Period: Feb 6, 2026 - Apr 3, 2026
(closed for review but you can still tweet)

NOTE: This is an unreviewed Preprint

Warning: This is a unreviewed preprint (What is a preprint?). Readers are warned that the document has not been peer-reviewed by expert/patient reviewers or an academic editor, may contain misleading claims, and is likely to undergo changes before final publication, if accepted, or may have been rejected/withdrawn (a note "no longer under consideration" will appear above).

Peer review me: Readers with interest and expertise are encouraged to sign up as peer-reviewer, if the paper is within an open peer-review period (in this case, a "Peer Review Me" button to sign up as reviewer is displayed above). All preprints currently open for review are listed here. Outside of the formal open peer-review period we encourage you to tweet about the preprint.

Citation: Please cite this preprint only for review purposes or for grant applications and CVs (if you are the author).

Final version: If our system detects a final peer-reviewed "version of record" (VoR) published in any journal, a link to that VoR will appear below. Readers are then encourage to cite the VoR instead of this preprint.

Settings: If you are the author, you can login and change the preprint display settings, but the preprint URL/DOI is supposed to be stable and citable, so it should not be removed once posted.

Submit: To post your own preprint, simply submit to any JMIR journal, and choose the appropriate settings to expose your submitted version as preprint.

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Occam’s Razor in AI-assisted complex diagnosis: a comparative effectiveness study of single large language models versus multi-agent systems in resource-constrained primary care settings

  • Tengfei Cai; 
  • Naiguang Zhang; 
  • Yansheng LI; 
  • Xiaoyan Li

ABSTRACT

Background:

Primary care physicians in resource-constrained settings, particularly within low-income and middle-income countries (LMICs), frequently encounter a "diagnostic gap" when managing complex, rare, or multisystemic pathologies. While Large Language Models (LLMs) demonstrate significant potential to augment clinical reasoning, current state-of-the-art solutions rely predominantly on high-bandwidth cloud infrastructure, limiting their deployment in regions with unstable internet connectivity and strict data sovereignty regulations.

Objective:

The prevailing technological consensus in computer science suggests that "Agentic Workflows" or Multi-Agent Systems (MAS)—which orchestrate multiple models to simulate collective reasoning—inherently offer superior accuracy and safety compared to single models. However, the comparative efficacy, safety, and cost-effectiveness of complex MAS versus single localised models in offline, hardware-limited environments remain unproven.

Methods:

We conducted a prospective comparative benchmarking study using the DiagnosisArena dataset, comprising 915 complex clinical cases across 28 medical specialties. To simulate a secure, offline primary care environment, we evaluated five locally deployed single open-source LLMs (GPT-oss-20b Llama3.1-70B, Qwen3-32B, DeepSeek-R1-32B, Gemma3-27B) against two Multi-Agent architectures: a Standard voting ensemble and a novel hierarchical Adaptive Weighted System. All models were hosted on a local server (4×NVIDIA A100) using the Dify platform. Performance was adjudicated against a Reference Standard established by the consensus of three board-certified physicians using a dual-metric system: a 10-point Diagnostic Recall Scale and a comprehensive Hallucination/Safety Index. Inference latency and computational resource utilisation were recorded to assess cost-effectiveness.

Results:

Contrary to the hypothesis that architectural complexity yields diagnostic precision, single high-performance models significantly outperformed complex ensembles. The single GPT-oss-20b model achieved the highest Diagnostic Recall Score (mean 4.68 [SD 3.82]), statistically surpassing the Adaptive Weighted Multi-Agent System (4.13 [SD 3.43]; p<0.001) and smaller models such as Gemma3-27B (2.89 [SD 3.89]; p<0.001). The Adaptive System, despite utilising dynamic routing, failed to outperform the median score of human physicians (4.22 [SD 3.62]; p=0.432). Furthermore, the inclusion of mid-tier models in the adaptive workflow introduced an "ensemble degradation" effect, significantly lowering the Safety Score compared to the single GPT-oss-20b model (4.99 vs 5.50; p<0.001) and reducing the rate of Top-1 correct diagnoses from 51.58% to 46.89%. Crucially, the single GPT-oss-20b model demonstrated superior efficiency with an average inference time of 30 seconds per case, compared to 200 seconds for the Standard Multi-Agent System—representing an 85% reduction in latency.

Conclusions:

In the context of clinical diagnosis, architectural complexity does not equate to clinical utility. We identified a phenomenon of "ensemble degradation," where integrating mid-tier models into ensembles dilutes the reasoning capabilities of strong base models through the introduction of diagnostic noise. For global health equity, implementation strategies should prioritise "Lean AI"—localising a single, robust open-source model—rather than orchestrating computationally expensive agent swarms. This approach provides a safer, more accurate, and scientifically validated path for bridging the diagnostic gap in resource-constrained primary care.


 Citation

Please cite as:

Cai T, Zhang N, LI Y, Li X

Occam’s Razor in AI-assisted complex diagnosis: a comparative effectiveness study of single large language models versus multi-agent systems in resource-constrained primary care settings

JMIR Preprints. 05/02/2026:92925

DOI: 10.2196/preprints.92925

URL: https://preprints.jmir.org/preprint/92925

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.