JMIR Preprints #100148: Performance of ChatGPT, Claude and AMBOSS on the European Board of Urology in-service assessment: a comparative analysis and alignment with EAU 2025 guidelines

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Performance of ChatGPT, Claude and AMBOSS on the European Board of Urology in-service assessment: a comparative analysis and alignment with EAU 2025 guidelines

Karl H. Pang;
Shaza Hendy;
Mohamed Eldaneen;
Omar Ramadan;
Panagiotis Nikolinakos

ABSTRACT

Background:

Recent advances in artificial intelligence (AI), particularly large language models, have generated growing interest in their application to medical education and examination preparation. However, the accuracy, reasoning quality, and adherence to clinical guidelines of these tools in postgraduate urology assessments remain unclear.

Objective:

To evaluate the performance of three AI tools ChatGPT (GPT-4.0), Claude-4.5, and AMBOSS on European Board of Urology (EBU)-style multiple-choice questions, with particular focus on accuracy, insight, concordance, and adherence to European Association of Urology (EAU) guidelines.

Methods:

A total of 200 single-best-answer questions from the EBU In-Service Assessment workbook (2021–2022) were input into each AI model. Models were prompted to select an answer and provide an explanation. Two post-FRCS urologists independently assessed outputs. Accuracy was defined as correct answer selection. Insight was evaluated across three domains: non-obvious deduction, discriminative reasoning, and clinical validity, graded as low, moderate, or high. Concordance was defined as logical alignment between the answer and its explanation.

Results:

ChatGPT demonstrated the highest accuracy (85.5%), compared to Claude and AMBOSS (both 79.5%). Concordance was also highest for ChatGPT (95%), followed by Claude (88%) and AMBOSS (76%). Non-obvious deduction was predominantly low-to-moderate across all models, reflecting the recall-based nature of many questions. ChatGPT and Claude showed stronger discriminative reasoning, while AMBOSS demonstrated limited exclusion of alternative options. Clinical validity was high overall, with ChatGPT showing the greatest consistency with EAU guidelines.

Conclusions:

AI tools can achieve high accuracy on EBU-style assessments; however, differences in reasoning quality and guideline adherence are evident. ChatGPT demonstrated superior performance across all evaluated domains, supporting its role as a potential adjunct in postgraduate urology education.

Citation

Please cite as:

Pang KH, Hendy S, Eldaneen M, Ramadan O, Nikolinakos P

Performance of ChatGPT, Claude and AMBOSS on the European Board of Urology in-service assessment: a comparative analysis and alignment with EAU 2025 guidelines

JMIR Preprints. 03/05/2026:100148

DOI: 10.2196/preprints.100148

URL: https://preprints.jmir.org/preprint/100148

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Currently submitted to: JMIR Formative Research

Date Submitted: May 3, 2026

Open Peer Review Period: May 6, 2026 - Jul 1, 2026

(currently open for review)

Performance of ChatGPT, Claude and AMBOSS on the European Board of Urology in-service assessment: a comparative analysis and alignment with EAU 2025 guidelines

ABSTRACT

Citation

Copyright