Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Currently submitted to: JMIR Formative Research

Date Submitted: May 3, 2026
Open Peer Review Period: May 6, 2026 - Jul 1, 2026
(currently open for review)

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Performance of ChatGPT, Claude and AMBOSS on the European Board of Urology in-service assessment: a comparative analysis and alignment with EAU 2025 guidelines

  • Karl H. Pang; 
  • Shaza Hendy; 
  • Mohamed Eldaneen; 
  • Omar Ramadan; 
  • Panagiotis Nikolinakos

ABSTRACT

Background:

Recent advances in artificial intelligence (AI), particularly large language models, have generated growing interest in their application to medical education and examination preparation. However, the accuracy, reasoning quality, and adherence to clinical guidelines of these tools in postgraduate urology assessments remain unclear.

Objective:

To evaluate the performance of three AI tools ChatGPT (GPT-4.0), Claude-4.5, and AMBOSS on European Board of Urology (EBU)-style multiple-choice questions, with particular focus on accuracy, insight, concordance, and adherence to European Association of Urology (EAU) guidelines.

Methods:

A total of 200 single-best-answer questions from the EBU In-Service Assessment workbook (2021–2022) were input into each AI model. Models were prompted to select an answer and provide an explanation. Two post-FRCS urologists independently assessed outputs. Accuracy was defined as correct answer selection. Insight was evaluated across three domains: non-obvious deduction, discriminative reasoning, and clinical validity, graded as low, moderate, or high. Concordance was defined as logical alignment between the answer and its explanation.

Results:

ChatGPT demonstrated the highest accuracy (85.5%), compared to Claude and AMBOSS (both 79.5%). Concordance was also highest for ChatGPT (95%), followed by Claude (88%) and AMBOSS (76%). Non-obvious deduction was predominantly low-to-moderate across all models, reflecting the recall-based nature of many questions. ChatGPT and Claude showed stronger discriminative reasoning, while AMBOSS demonstrated limited exclusion of alternative options. Clinical validity was high overall, with ChatGPT showing the greatest consistency with EAU guidelines.

Conclusions:

AI tools can achieve high accuracy on EBU-style assessments; however, differences in reasoning quality and guideline adherence are evident. ChatGPT demonstrated superior performance across all evaluated domains, supporting its role as a potential adjunct in postgraduate urology education.


 Citation

Please cite as:

Pang KH, Hendy S, Eldaneen M, Ramadan O, Nikolinakos P

Performance of ChatGPT, Claude and AMBOSS on the European Board of Urology in-service assessment: a comparative analysis and alignment with EAU 2025 guidelines

JMIR Preprints. 03/05/2026:100148

DOI: 10.2196/preprints.100148

URL: https://preprints.jmir.org/preprint/100148

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.