JMIR Preprints #64963: Comparing Diagnostic Accuracy of Clinical Professionals and Large Language Models: Systematic Review

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Comparing Diagnostic Accuracy of Clinical Professionals and Large Language Models: Systematic Review

Guxue Shan;
Xiaonan Chen;
Chen Wang;
Li Liu;
Yuanjing Gu;
Huiping Jiang;
Tingqi Shi

ABSTRACT

Background:

In the era of healthcare big data, integrating artificial intelligence with clinical decision support systems has become a significant trend. Although many experts have investigated the application of specialized AI and software tools in clinical diagnosis, the performance of large language models in this area remains underexplored.

Objective:

This study systematically reviewed the accuracy of large language model in clinical diagnosis, and provided reference for further clinical application.

Methods:

We conducted searches in CNKI, VIP Database, SinoMed, PubMed, Web of Science, Embase, and CINAHL from January 1, 2017, to the present. Two reviewers independently screened the literature and extracted relevant information. The risk of bias was assessed using the Prediction Model Risk of Bias Assessment Tool (PROBAST), which evaluates both the risk of bias and the applicability of included studies.

Results:

Twenty studies involving seven large language models and a total of 2787 cases were included. Quality assessment indicated that the included studies generally had a low risk of bias. For the optimal model, the accuracy of the primary diagnosis ranged from 25% to 97.8%, while the triage accuracy ranged from 66.7% to 98%.

Conclusions:

Large language models have demonstrated certain diagnostic capabilities and significant potential for application in various clinical cases. Further research involving larger sample sizes, multicenter collaborations, and high-quality studies is necessary to fully explore the diagnostic performance of these models.

Citation

Please cite as:

Shan G, Chen X, Wang C, Liu L, Gu Y, Jiang H, Shi T

Comparing Diagnostic Accuracy of Clinical Professionals and Large Language Models: Systematic Review and Meta-Analysis

JMIR Med Inform 2025;13:e64963

DOI: 10.2196/64963

PMID: 40279517

PMCID: 12047852

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Jul 31, 2024

Date Accepted: Mar 25, 2025

Comparing Diagnostic Accuracy of Clinical Professionals and Large Language Models: Systematic Review

ABSTRACT

Citation

Copyright