Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Sep 15, 2025
Open Peer Review Period: Sep 15, 2025 - Nov 10, 2025
Date Accepted: Nov 12, 2025
(closed for review but you can still tweet)
Knowledge-Practice Performance Gap in Clinical Large Language Models: A Systematic Review and Quality Assessment of 39 Benchmarks
ABSTRACT
Background:
The evaluation of Large Language Models (LLMs) in medicine has undergone a shift from knowledge-based testing to practice-based assessment, marking a critical evolution in how we measure artificial intelligence (AI) readiness for clinical deployment.
Objective:
This paper provides clinicians with a comprehensive framework for understanding clinical medicine LLM benchmarks, interpreting performance metrics, and evaluating the transition from traditional knowledge assessments to real-world clinical practice evaluation.
Methods:
Core databases including MEDLINE/PubMed, EMBASE, Cochrane library, and arXiv were searched using keywords related to clinical medicine benchmarks in LLMs (from inception to Aug 2025). Studies were included if they satisfied the following criteria: (1) any type of studies investigated the clinical medicine benchmarks in LLMs field; (2) studies published in English; and (3) studies in full-text format. The exclusion criteria were as follows: (1) studies that did not report the clinical medicine benchmarks in LLMs field; (2) studies that report the benchmarks other than clinical medicine field area. Methodological quality was assessed by Mixed Methods Appraisal Tool.
Results:
Our systematic review identified 39 medical LLM benchmarks, categorized into 21 knowledge-based (54%), 15 practice-based (38%), and 3 hybrid frameworks (8%). These benchmarks collectively encompass over 2 million questions and conversations across multiple languages including English, Chinese, Spanish, Korean, Swedish, and 15 African languages. Traditional knowledge-based benchmarks show saturation with leading models achieving 84-90% accuracy on USMLE-style examinations. However, practice-based assessments reveal performance challenges, with specific benchmarks showing varied results: DiagnosisArena 45.82%, MedAgentBench 69.67%, and HealthBench 60% success rates. Performance differences between knowledge and practice benchmarks vary considerably, with the largest gaps in safety-critical scenarios requiring nuanced clinical judgment.
Conclusions:
The evolution from knowledge-based to practice-based benchmarks represents a necessary maturation in clinical medicine AI evaluation. While LLMs achieve high accuracy on knowledge tests, practice-based benchmarks show varied results, revealing gaps in clinical reasoning and safety assessment. This shift enables realistic evaluation of AI readiness for clinical deployment, highlighting the need for benchmarks that prioritize ecological validity and comprehensive safety evaluation.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.