Accepted for/Published in: JMIR Medical Education
Date Submitted: Dec 2, 2024
Date Accepted: Apr 30, 2025
Chatbot’s role in generating Single Best Answer (SBA) questions for undergraduate medical student assessment: A comparative analysis.
ABSTRACT
Background:
Programmatic assessment offers flexible learning modalities that support individual progression but present a challenge to educators as they are required to develop frequent assessments that reflect different competencies. Multiple-choice questions (MCQs) have been adopted by medical education worldwide for assessing knowledge and clinical reasoning skills in high-stakes undergraduate and postgraduate medical exams. The continuous creation of large volumes of assessment items, in a consistent format, in a comparatively restricted time is laborious. To address this challenge, the application of technological innovations including artificial intelligence (AI) has been tried. A major concern raised is the validity of the information produced by AI tools and, if not properly verified, can produce inaccurate and therefore inappropriate assessments.
Objective:
This study was designed to examine the content validity and consistency of different AI Chatbots in creating single best answer (SBA) questions, a refined format of MCQs better suited to assess higher levels of knowledge, for undergraduate medical students.
Methods:
The study followed three steps: 1) three researchers used a unified prompt script to generate ten SBA questions across four Chatbot platforms, 2) the Chatbot outputs were assessed for consistency by identifying the similarities and differences between users and across the different Chatbots and 3) the questions were internally moderated using a rating scale developed by the research team, to evaluate the scientific accuracy and educational quality.
Results:
In response to the prompts, all Chatbots generated ten questions each except Bing which was unable to respond to one prompt. ChatGPT Plus had the highest degree of variation in generating questions across multiple users; however, it fell short of satisfying the “cover test”. Overall, Gemini performed well across most items, except for item balance; it also stood out by creating questions with a lead-in that relied heavily on the vignette for the answer but was let down by its preference for one answer option. Bing scored low across most evaluation items but performed well in generating a lead-in question of appropriate length. SBA questions from three chatbots (ChatGPT, Gemini and ChatGPT Plus) had very similar Item Content Validity Index and Scale Content Validity Index values. In comparison, Bing performed poorer in content clarity, overall validity and accuracy of item construction.
Conclusions:
AI Chatbots can aid the production of questions aligned with learning objectives and individual Chatbots have their own strengths and weaknesses. Nevertheless, all require expert evaluation to ensure their suitability for use. Using AI to generate SBAs prompts us to reconsider Bloom’s taxonomy of the cognitive domain, which traditionally positions creation as the highest level of cognition.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.