JMIR Preprints #98026: Diagnostic Accuracy of Critical Care-Specialized Versus General-Purpose Large Language Models in Emergency Intensive Care Unit Diseases: A Paired Comparative Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Diagnostic Accuracy of Critical Care-Specialized Versus General-Purpose Large Language Models in Emergency Intensive Care Unit Diseases: A Paired Comparative Study

立宏郑

ABSTRACT

Background:

The Emergency Intensive Care Unit (EICU) is the core setting for the treatment of critically ill patients, where the diagnostic error rate is more than twice that of general inpatient wards, which seriously affects patient prognosis. Large Language Models (LLMs) have shown application potential in clinical diagnosis, but there is still very limited evidence comparing the diagnostic efficacy of critical care-specific LLMs and general-purpose LLMs in the complex diagnostic scenarios of the EICU.

Objective:

This study aimed to evaluate and compare the diagnostic accuracy of a critical care-specific LLM (Qiyuan 3.0.1) and three mainstream general-purpose LLMs (GPT5.1, DeepSeek V3.1, Qwen3-32B) in EICU diseases, and to provide evidence-based basis for the selection of intelligent auxiliary diagnostic tools in the EICU.

Methods:

This was a single-center retrospective paired diagnostic accuracy study, which consecutively enrolled 184 critically ill patients admitted to the EICU of Peking University Shenzhen Hospital from April 2025 to March 2026. Standardized datasets were constructed based on the patients' clinical data, including an initial diagnosis dataset (clinical data within 24 hours after admission) and a final diagnosis dataset (complete course data from admission to discharge). A unified zero-shot learning prompt strategy was adopted, and four LLMs independently generated corresponding diagnoses in a double-blind manner. The consensus diagnosis reached by three senior intensive care physicians with more than 10 years of EICU working experience, who were blinded to the model results, was used as the gold standard. The primary endpoint was the Top-1 accuracy in the final diagnosis stage, defined as the proportion of cases where the first primary diagnosis output by the model completely matched the gold standard. Secondary endpoints included the Top-1 accuracy in the initial diagnosis stage and the number of correct diagnoses in the Top-3 outputs in the final diagnosis stage. Cochran's Q test was used for the overall comparison of accuracy among multiple groups, and post hoc pairwise comparisons were performed using the paired McNemar test with Bonferroni correction for type I error. The Friedman non-parametric rank sum test was used for the intergroup comparison of the number of correct Top-3 diagnoses.

Results:

In the final diagnosis stage, the overall difference in Top-1 accuracy among the four models was statistically significant (Cochran's Q=20.32, df=3, P=4.57×10⁻⁵). The Top-1 accuracy of Qiyuan 3.0.1 was the highest (64.13%, 95%CI 56.83%-71.00%), followed by GPT5.1 (59.24%, 95%CI 51.83%-66.35%), DeepSeek V3.1 (57.07%, 95%CI 49.64%-64.28%), and Qwen3-32B had the lowest accuracy (51.63%, 95%CI 44.26%-58.98%). Post hoc pairwise comparisons showed that the Top-1 accuracy of Qiyuan 3.0.1, GPT5.1, and DeepSeek V3.1 was significantly higher than that of Qwen3-32B (all adjusted P<0.0083), while no significant difference was found in other pairwise comparisons (all adjusted P>0.0083). A similar trend was observed in the initial diagnosis stage, where only Qiyuan 3.0.1 was significantly superior to Qwen3-32B (adjusted P=0.008). The median number of correct Top-3 diagnoses for all four models was 2.0 (IQR 1.0-2.0), with no significant intergroup difference (Friedman χ²=3.34, df=3, P=0.339).

Conclusions:

The critical care-specific LLM Qiyuan 3.0.1 has superior Top-1 diagnostic accuracy in EICU diseases compared with some general-purpose LLMs, but the absolute diagnostic accuracy of all included models still has considerable room for improvement. LLMs have potential application value as auxiliary diagnostic tools in the EICU, but their clinical application still requires further optimization and multi-center prospective clinical trial validation.

Citation

Please cite as:

郑 �

Diagnostic Accuracy of Critical Care-Specialized Versus General-Purpose Large Language Models in Emergency Intensive Care Unit Diseases: A Paired Comparative Study

JMIR Preprints. 12/04/2026:98026

DOI: 10.2196/preprints.98026

URL: https://preprints.jmir.org/preprint/98026

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Currently submitted to: Journal of Medical Internet Research

Date Submitted: Apr 12, 2026

Open Peer Review Period: Apr 13, 2026 - Jun 8, 2026

(currently open for review)

Diagnostic Accuracy of Critical Care-Specialized Versus General-Purpose Large Language Models in Emergency Intensive Care Unit Diseases: A Paired Comparative Study

ABSTRACT

Citation

Copyright