Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Sep 15, 2025
Open Peer Review Period: Sep 15, 2025 - Nov 10, 2025
Date Accepted: Nov 12, 2025
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Knowledge-Practice Performance Gap in Clinical Large Language Models: Systematic Review of 39 Benchmarks

Gong EJ, Seok C, Lee JJ, Baik GH

Knowledge-Practice Performance Gap in Clinical Large Language Models: Systematic Review of 39 Benchmarks

J Med Internet Res 2025;27:e84120

DOI: 10.2196/84120

PMID: 41325597

PMCID: 12706444

Knowledge-Practice Performance Gap in Clinical Large Language Models: A Systematic Review and Quality Assessment of 39 Benchmarks

  • Eun Jeong Gong; 
  • Chang Seok; 
  • Jae Jun Lee; 
  • Gwang Ho Baik

ABSTRACT

Background:

The evaluation of Large Language Models (LLMs) in medicine has undergone a shift from knowledge-based testing to practice-based assessment, marking a critical evolution in how we measure artificial intelligence (AI) readiness for clinical deployment.

Objective:

This paper provides clinicians with a comprehensive framework for understanding clinical medicine LLM benchmarks, interpreting performance metrics, and evaluating the transition from traditional knowledge assessments to real-world clinical practice evaluation.

Methods:

Core databases including MEDLINE/PubMed, EMBASE, Cochrane library, and arXiv were searched using keywords related to clinical medicine benchmarks in LLMs (from inception to Aug 2025). Studies were included if they satisfied the following criteria: (1) any type of studies investigated the clinical medicine benchmarks in LLMs field; (2) studies published in English; and (3) studies in full-text format. The exclusion criteria were as follows: (1) studies that did not report the clinical medicine benchmarks in LLMs field; (2) studies that report the benchmarks other than clinical medicine field area. Methodological quality was assessed by Mixed Methods Appraisal Tool.

Results:

Our systematic review identified 39 medical LLM benchmarks, categorized into 21 knowledge-based (54%), 15 practice-based (38%), and 3 hybrid frameworks (8%). These benchmarks collectively encompass over 2 million questions and conversations across multiple languages including English, Chinese, Spanish, Korean, Swedish, and 15 African languages. Traditional knowledge-based benchmarks show saturation with leading models achieving 84-90% accuracy on USMLE-style examinations. However, practice-based assessments reveal performance challenges, with specific benchmarks showing varied results: DiagnosisArena 45.82%, MedAgentBench 69.67%, and HealthBench 60% success rates. Performance differences between knowledge and practice benchmarks vary considerably, with the largest gaps in safety-critical scenarios requiring nuanced clinical judgment.

Conclusions:

The evolution from knowledge-based to practice-based benchmarks represents a necessary maturation in clinical medicine AI evaluation. While LLMs achieve high accuracy on knowledge tests, practice-based benchmarks show varied results, revealing gaps in clinical reasoning and safety assessment. This shift enables realistic evaluation of AI readiness for clinical deployment, highlighting the need for benchmarks that prioritize ecological validity and comprehensive safety evaluation.


 Citation

Please cite as:

Gong EJ, Seok C, Lee JJ, Baik GH

Knowledge-Practice Performance Gap in Clinical Large Language Models: Systematic Review of 39 Benchmarks

J Med Internet Res 2025;27:e84120

DOI: 10.2196/84120

PMID: 41325597

PMCID: 12706444

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.