Accepted for/Published in: JMIR Formative Research
Date Submitted: Sep 26, 2025
Date Accepted: Dec 24, 2025
Fine-Tuned Large Language Models for Generating Multiple-Choice Questions in Anesthesiology: A Psychometric Comparison with Faculty-Written Items
ABSTRACT
Background:
Multiple-choice examinations (MCQs) are widely used in medical education to ensure standardized and objective assessment. Developing high-quality items requires both subject expertise and methodological rigor. Large language models (LLMs) offer new opportunities for automated item generation. However, most evaluations rely on general-purpose prompting, and psychometric comparisons with faculty-written items remain scarce.
Objective:
This study aimed to evaluate whether a supervised fine-tuned LLM can generate multiple-choice questions with psychometric properties comparable to those of expert-authored items in a real undergraduate anesthesiology examination. The study further examined whether both item sets showed differences in difficulty, selectivity, and discrimination index.
Methods:
The study was embedded in the regular written anesthesiology examination of the eighth semester medical curriculum with 157 students. The exam comprised 30 single best answer MCQs, of which 15 were generated by senior faculty and 15 by a fine-tuned GPT-based model. A custom GPT-based model was adapted with anesthesiology lecture slides, the National Competence-Based Learning Objectives Catalogue (NKLM 2.0), past exam questions, and faculty publications using supervised instruction-tuning with standardized prompt–response pairs. Item analysis followed established psychometric standards.
Results:
In total, 29 items (14 human, 15 AI) were analyzed. Human-generated questions had a mean difficulty of 0.81 (SD = 0.19), selectivity of 0.26 (SD = 0.13), and discrimination index of 0.09 (SD = 0.08). AI-generated questions had a mean difficulty of 0.79 (SD = 0.18), selectivity of 0.20 (SD = 0.13), and discrimination index of 0.08 (SD = 0.11). Mann-Whitney U tests revealed no significant differences between human- and AI-generated items for difficulty (p = 0.377), selectivity (p = 0.158), or discrimination index (p = 0.591). Categorical analyses confirmed no significant group differences. Both sets, however, showed only modest psychometric quality.
Conclusions:
Supervised fine-tuned LLMs are capable of generating multiple-choice questions with psychometric properties comparable to those written by experienced faculty. Given the limitations and cohort-dependency of psychometric indices, automated item generation should be considered a complement rather than a replacement for manual item writing. Further research with larger item sets and multi-institutional validation is needed to confirm generalizability and optimize integration of LLM-based tools into assessment development.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.