Accepted for/Published in: JMIR Medical Education
Date Submitted: Aug 20, 2025
Date Accepted: Apr 20, 2026
(closed for review but you can still tweet)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
A Novel Method for Detecting Ambiguity in Medical Exams Using Large Language Models
ABSTRACT
Background:
Large Language Models (LLMs) have emerged as promising tools in medical education due to their ability to understand, generate and reason with natural language. Their ability to simulate expert reasoning extends beyond answering questions, enabling them to support quality control in assessment design. In this study, we evaluated the utility of LLMs in identifying ambiguous or poorly constructed exam items in critical care academic assessments.
Objective:
We developed automated ambiguity and quality scores to objectively assess individual questions and entire exam components.
Methods:
We analyzed 264 questions from academic exams conducted over three academic years (2023 to 2025) at the Medical School of Université Côte d’Azur. Questions were drawn from four docimological formats: Progressive Clinical Cases (PCC), Mini-PCC, Key Feature Problems (KFP) and Isolated Questions Sequence (IQS). Each element was submitted to four LLMs (ChatGPT, Gemini Pro, Le Chat and DeepSeek) without prompt engineering. Performance was evaluated using the official correction key. We applied four binary diagnostic tags based on model agreement and self-reported ambiguity: ambiguity, low performance, incoherence and subjective ambiguity. These tags generated a composite ambiguity score and contributed to a weighted quality score for each exam component.
Results:
LLMs performed comparably to students, with statistically significant superior performance on the mPCC and IQS formats. IQS items had the highest ambiguity scores. Tag patterns revealed frequent issues with ambiguity and inconsistency. Quality scores varied across academic year. IQS predominantly showed moderate ambiguity (score 2), with occasional instances of strong signals. There was no significant difference in quality based on author specialty or seniority.
Conclusions:
LLMs can serve as objective tools to proactively detect ambiguous exam questions and estimate the overall quality of an exam. Integrating these tools into the assessment design process can reduce the need for post-exam corrections and improve fairness and clarity in medical evaluations.
Citation