Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Jul 18, 2024
Date Accepted: Apr 3, 2025

The final, peer-reviewed published version of this preprint can be found here:

Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis

Wang L, Li J, Zhuang B, Huang S, Fang M, Wang C, Li W, Zhang M, Gong S

Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis

J Med Internet Res 2025;27:e64486

DOI: 10.2196/64486

PMID: 40305085

PMCID: 12079073

The accuracy of LLMs in answering clinical research questions: a systematic review and network meta analysis

  • Ling Wang; 
  • Jinglin Li; 
  • Boyang Zhuang; 
  • Shasha Huang; 
  • Meilin Fang; 
  • Cunze Wang; 
  • Wen Li; 
  • Mohan Zhang; 
  • Shurong Gong

ABSTRACT

Background:

Large language models (LLMs) have flourished and gradually become an important research and application direction in the medical field recently. However, due to the high degree of specialization, complexity, and specificity of medicine, which results in extremely high accuracy requirements, it is still controversial whether LLMs can be used in the medical field. More and more studies have evaluated the performance of various types of LLMs in medicine, but the conclusions are not consistent.

Objective:

This study uses network meta-analysis to assess the accuracy of LLMs in answering clinical research questions to provide high-level evidence-based evidence for its future development and application in the medical field.

Methods:

In this systematic review and network meta-analysis, we searched PubMed, Embase, Web of Science and Scopus, from inception up to 14 October 2024. Studies on the accuracy of LLMs in answering clinical research questions were included and screened by reading published reports. Systematic review and network meta-analysis were conducted to compare the accuracy of different LLMs in answering clinical research questions, including objective questions, open-ended questions, top 1 diagnosis, top 3 diagnosis, top 5 diagnosis, and triage and classification. Network meta-analysis (NMA) was performed using Bayesian frequency theory methods. Indirect inter-comparisons between programmes were performed using a grading scale. Larger surface under the cumulative ranking (SUCRA) indicate a higher ranking of the corresponding LLM accuracy.

Results:

The systematic review and NMA examined 168 articles encompassing 35,896 questions and 3,063 clinical cases. Out of the 168 studies, 40 (23.81%) were considered to have low risk of bias, 128 (76.19%) had moderate risk, with none rated as high risk. ChatGPT-4o (SUCRA 0.9207) demonstrated a strong performance in accuracy for objective questions, followed by Aeyeconsult (SUCRA 0.9187) and ChatGPT-4 (SUCRA 0.8087). ChatGPT-4 (SUCRA 0.8708) excelled in answering open-ended questions. In terms of accuracy for top 1 diagnosis and top 3 diagnosis of clinical cases, human (SUCRA 0.9001, 0.7126) experts ranked highest, while Claude 3 Opus (SUCRA 0.9672) performed well in top 5 diagnosis. Gemini (SUCRA 0.9649) was rated highest in SUCRA value for accuracy in triage and classification.

Conclusions:

Our study indicates that ChatGPT-4o has an advantage for answering objective questions. And for open-ended questions, ChatGPT-4 may be more credible. Humans are more accurate in top 1 diagnosis and top 3 diagnosis. Claude 3 Opus performs better in top 5 diagnosis, while for triage and classification, Gemini is more advantageous. This analysis offers valuable insights for clinicians and medical practitioners, empowering them to effectively leverage LLMs for improved decision-making in learning, diagnosis, and management of various clinical scenarios. Clinical Trial: PROSPERO under registration number CRD42024558245.


 Citation

Please cite as:

Wang L, Li J, Zhuang B, Huang S, Fang M, Wang C, Li W, Zhang M, Gong S

Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis

J Med Internet Res 2025;27:e64486

DOI: 10.2196/64486

PMID: 40305085

PMCID: 12079073

Per the author's request the PDF is not available.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.