Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Jul 31, 2024
Date Accepted: Mar 25, 2025

The final, peer-reviewed published version of this preprint can be found here:

Comparing Diagnostic Accuracy of Clinical Professionals and Large Language Models: Systematic Review and Meta-Analysis

Shan G, Chen X, Wang C, Liu L, Gu Y, Jiang H, Shi T

Comparing Diagnostic Accuracy of Clinical Professionals and Large Language Models: Systematic Review and Meta-Analysis

JMIR Med Inform 2025;13:e64963

DOI: 10.2196/64963

PMID: 40279517

PMCID: 12047852

Comparing Diagnostic Accuracy of Clinical Professionals and Large Language Models: Systematic Review

  • Guxue Shan; 
  • Xiaonan Chen; 
  • Chen Wang; 
  • Li Liu; 
  • Yuanjing Gu; 
  • Huiping Jiang; 
  • Tingqi Shi

ABSTRACT

Background:

In the era of healthcare big data, integrating artificial intelligence with clinical decision support systems has become a significant trend. Although many experts have investigated the application of specialized AI and software tools in clinical diagnosis, the performance of large language models in this area remains underexplored.

Objective:

This study systematically reviewed the accuracy of large language model in clinical diagnosis, and provided reference for further clinical application.

Methods:

We conducted searches in CNKI, VIP Database, SinoMed, PubMed, Web of Science, Embase, and CINAHL from January 1, 2017, to the present. Two reviewers independently screened the literature and extracted relevant information. The risk of bias was assessed using the Prediction Model Risk of Bias Assessment Tool (PROBAST), which evaluates both the risk of bias and the applicability of included studies.

Results:

Twenty studies involving seven large language models and a total of 2787 cases were included. Quality assessment indicated that the included studies generally had a low risk of bias. For the optimal model, the accuracy of the primary diagnosis ranged from 25% to 97.8%, while the triage accuracy ranged from 66.7% to 98%.

Conclusions:

Large language models have demonstrated certain diagnostic capabilities and significant potential for application in various clinical cases. Further research involving larger sample sizes, multicenter collaborations, and high-quality studies is necessary to fully explore the diagnostic performance of these models.


 Citation

Please cite as:

Shan G, Chen X, Wang C, Liu L, Gu Y, Jiang H, Shi T

Comparing Diagnostic Accuracy of Clinical Professionals and Large Language Models: Systematic Review and Meta-Analysis

JMIR Med Inform 2025;13:e64963

DOI: 10.2196/64963

PMID: 40279517

PMCID: 12047852

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.