JMIR Preprints #82702: A Novel Method for Detecting Ambiguity in Medical Exams Using Large Language Models

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

A Novel Method for Detecting Ambiguity in Medical Exams Using Large Language Models

Romain Lombardi;
Alexandre Destere;
Jean Dellamonica;
Alexandre O. Gérard;
Mathieu Jozwiak

ABSTRACT

Background:

Large Language Models (LLMs) have emerged as promising tools in medical education due to their ability to understand, generate and reason with natural language. Their ability to simulate expert reasoning extends beyond answering questions, enabling them to support quality control in assessment design. In this study, we evaluated the utility of LLMs in identifying ambiguous or poorly constructed exam items in critical care academic assessments.

Objective:

We developed automated ambiguity and quality scores to objectively assess individual questions and entire exam components.

Methods:

We analyzed 264 questions from academic exams conducted over three academic years (2023 to 2025) at the Medical School of Université Côte d’Azur. Questions were drawn from four docimological formats: Progressive Clinical Cases (PCC), Mini-PCC, Key Feature Problems (KFP) and Isolated Questions Sequence (IQS). Each element was submitted to four LLMs (ChatGPT, Gemini Pro, Le Chat and DeepSeek) without prompt engineering. Performance was evaluated using the official correction key. We applied four binary diagnostic tags based on model agreement and self-reported ambiguity: ambiguity, low performance, incoherence and subjective ambiguity. These tags generated a composite ambiguity score and contributed to a weighted quality score for each exam component.

Results:

LLMs performed comparably to students, with statistically significant superior performance on the mPCC and IQS formats. IQS items had the highest ambiguity scores. Tag patterns revealed frequent issues with ambiguity and inconsistency. Quality scores varied across academic year. IQS predominantly showed moderate ambiguity (score 2), with occasional instances of strong signals. There was no significant difference in quality based on author specialty or seniority.

Conclusions:

LLMs can serve as objective tools to proactively detect ambiguous exam questions and estimate the overall quality of an exam. Integrating these tools into the assessment design process can reduce the need for post-exam corrections and improve fairness and clarity in medical evaluations.

Citation

Please cite as:

Lombardi R, Destere A, Dellamonica J, Gérard AO, Jozwiak M

Ambiguity Detection in Medical Exams via Large Language Models: Retrospective Cross-Sectional Pilot Study

JMIR Med Educ 2026;12:e82702

DOI: 10.2196/82702

PMID: 42190230

PMCID: 13211589

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Education

Date Submitted: Aug 20, 2025

Date Accepted: Apr 20, 2026

(closed for review but you can still tweet)

A Novel Method for Detecting Ambiguity in Medical Exams Using Large Language Models

ABSTRACT

Citation

JMIR Preprints

Accepted for/Published in: JMIR Medical Education

Date Submitted: Aug 20, 2025

Date Accepted: Apr 20, 2026

(closed for review but you can still tweet)

A Novel Method for Detecting Ambiguity in Medical Exams Using Large Language Models

ABSTRACT

Citation

Per the author's request the PDF is not available.