Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Education

Date Submitted: Aug 20, 2025
Date Accepted: Apr 20, 2026
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Ambiguity Detection in Medical Exams via Large Language Models: Retrospective Cross-Sectional Pilot Study

Lombardi R, Destere A, Dellamonica J, Gérard AO, Jozwiak M

Ambiguity Detection in Medical Exams via Large Language Models: Retrospective Cross-Sectional Pilot Study

JMIR Med Educ 2026;12:e82702

DOI: 10.2196/82702

PMID: 42190230

PMCID: 13211589

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

A Novel Method for Detecting Ambiguity in Medical Exams Using Large Language Models

  • Romain Lombardi; 
  • Alexandre Destere; 
  • Jean Dellamonica; 
  • Alexandre O. Gérard; 
  • Mathieu Jozwiak

ABSTRACT

Background:

Large Language Models (LLMs) have emerged as promising tools in medical education due to their ability to understand, generate and reason with natural language. Their ability to simulate expert reasoning extends beyond answering questions, enabling them to support quality control in assessment design. In this study, we evaluated the utility of LLMs in identifying ambiguous or poorly constructed exam items in critical care academic assessments.

Objective:

We developed automated ambiguity and quality scores to objectively assess individual questions and entire exam components.

Methods:

We analyzed 264 questions from academic exams conducted over three academic years (2023 to 2025) at the Medical School of Université Côte d’Azur. Questions were drawn from four docimological formats: Progressive Clinical Cases (PCC), Mini-PCC, Key Feature Problems (KFP) and Isolated Questions Sequence (IQS). Each element was submitted to four LLMs (ChatGPT, Gemini Pro, Le Chat and DeepSeek) without prompt engineering. Performance was evaluated using the official correction key. We applied four binary diagnostic tags based on model agreement and self-reported ambiguity: ambiguity, low performance, incoherence and subjective ambiguity. These tags generated a composite ambiguity score and contributed to a weighted quality score for each exam component.

Results:

LLMs performed comparably to students, with statistically significant superior performance on the mPCC and IQS formats. IQS items had the highest ambiguity scores. Tag patterns revealed frequent issues with ambiguity and inconsistency. Quality scores varied across academic year. IQS predominantly showed moderate ambiguity (score 2), with occasional instances of strong signals. There was no significant difference in quality based on author specialty or seniority.

Conclusions:

LLMs can serve as objective tools to proactively detect ambiguous exam questions and estimate the overall quality of an exam. Integrating these tools into the assessment design process can reduce the need for post-exam corrections and improve fairness and clarity in medical evaluations.


 Citation

Please cite as:

Lombardi R, Destere A, Dellamonica J, Gérard AO, Jozwiak M

Ambiguity Detection in Medical Exams via Large Language Models: Retrospective Cross-Sectional Pilot Study

JMIR Med Educ 2026;12:e82702

DOI: 10.2196/82702

PMID: 42190230

PMCID: 13211589

PDF not available

Per the author's request the PDF is not available.