Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Currently submitted to: JMIR Research Protocols

Date Submitted: Jan 18, 2026
Open Peer Review Period: Jan 19, 2026 - Mar 16, 2026
(currently open for review)

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Large Language Models in German Continuing Medical Education Assessment: Fully Crossed Experimental Study Protocol

  • Leyla Özmen; 
  • Timur Sellmann; 
  • Christian Burisch; 
  • Daniel Gödde; 
  • Frank Breuckmann; 
  • Jan Ehlers

ABSTRACT

Background:

Continuing Medical Education (CME) is a legal and ethical obligation for physicians in Germany. The rapid rise of large language models (LLMs) such as ChatGPT, Gemini, Claude, and Grok raises concerns about the integrity of CME assessments, as LLMs can already pass German CME tests.

Objective:

To determine whether the choice of document format (searchable PDF, raster PDF, vector PDF) and LLM can influence the solvability of CME test questions by LLMs above the passing threshold specified for each CME module (typically 70%).

Methods:

In a fully crossed within-subjects repeated-measures structure, 18 expired CME articles from three major German publishers across six specialties will be converted into three PDF formats and processed by four current LLMs (ChatGPT-5, Mistral 3.1 small, Claude Sonnet 4, Grok-4) and two predecessor versions (ChatGPT-4o and Grok-3). Each model will answer every article once per file-format condition. This results in 18 experimental conditions. The primary outcome is the proportion of correctly answered questions; secondary outcomes are pass/fail rate and efficiency. The study has been approved by the University of Witten/Herdecke Ethics Committee (reference number S-260/2025, dated 08.10.2025) and is preregistered at the Open Science Framework (DOI: 10.17605/OSF.IO/V96R5).

Results:

Data collection will start in January 2026 and will last approximately 4 weeks. As of December 2025, the study has been preregistered, and no results are available yet. The analyses will quantify performance differences across document formats and model generations; these findings may inform the feasibility of non-searchable document formats as a temporary measure to reduce AI-enabled cheating risks in CME contexts.

Conclusions:

By quantifying how document format constrains LLM performance, this study aims to evaluate simple technical safeguards that may reduce AI-assisted manipulation of CME tests and inform regulators and CME providers on balancing assessment validity, accessibility, and responsible LLM integration into postgraduate medical education. Clinical Trial: Open Science Framework DOI: 10.17605/OSF.IO/V96R5.


 Citation

Please cite as:

Özmen L, Sellmann T, Burisch C, Gödde D, Breuckmann F, Ehlers J

Large Language Models in German Continuing Medical Education Assessment: Fully Crossed Experimental Study Protocol

JMIR Preprints. 18/01/2026:91675

DOI: 10.2196/preprints.91675

URL: https://preprints.jmir.org/preprint/91675

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.