Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Formative Research

Date Submitted: Sep 26, 2025
Date Accepted: Dec 24, 2025

The final, peer-reviewed published version of this preprint can be found here:

Fine-Tuned Large Language Models for Generating Multiple-Choice Questions in Anesthesiology: Psychometric Comparison With Faculty-Written Items

Hölzing CR, Meynhardt C, Meybohm P, König S, Kranke P

Fine-Tuned Large Language Models for Generating Multiple-Choice Questions in Anesthesiology: Psychometric Comparison With Faculty-Written Items

JMIR Form Res 2026;10:e84904

DOI: 10.2196/84904

PMID: 41707182

PMCID: 12916093

Fine-Tuned Large Language Models for Generating Multiple-Choice Questions in Anesthesiology: A Psychometric Comparison with Faculty-Written Items

  • Carlos Ramon Hölzing; 
  • Charlotte Meynhardt; 
  • Patrick Meybohm; 
  • Sarah König; 
  • Peter Kranke

ABSTRACT

Background:

Multiple-choice examinations (MCQs) are widely used in medical education to ensure standardized and objective assessment. Developing high-quality items requires both subject expertise and methodological rigor. Large language models (LLMs) offer new opportunities for automated item generation. However, most evaluations rely on general-purpose prompting, and psychometric comparisons with faculty-written items remain scarce.

Objective:

This study aimed to evaluate whether a supervised fine-tuned LLM can generate multiple-choice questions with psychometric properties comparable to those of expert-authored items in a real undergraduate anesthesiology examination. The study further examined whether both item sets showed differences in difficulty, selectivity, and discrimination index.

Methods:

The study was embedded in the regular written anesthesiology examination of the eighth semester medical curriculum with 157 students. The exam comprised 30 single best answer MCQs, of which 15 were generated by senior faculty and 15 by a fine-tuned GPT-based model. A custom GPT-based model was adapted with anesthesiology lecture slides, the National Competence-Based Learning Objectives Catalogue (NKLM 2.0), past exam questions, and faculty publications using supervised instruction-tuning with standardized prompt–response pairs. Item analysis followed established psychometric standards.

Results:

In total, 29 items (14 human, 15 AI) were analyzed. Human-generated questions had a mean difficulty of 0.81 (SD = 0.19), selectivity of 0.26 (SD = 0.13), and discrimination index of 0.09 (SD = 0.08). AI-generated questions had a mean difficulty of 0.79 (SD = 0.18), selectivity of 0.20 (SD = 0.13), and discrimination index of 0.08 (SD = 0.11). Mann-Whitney U tests revealed no significant differences between human- and AI-generated items for difficulty (p = 0.377), selectivity (p = 0.158), or discrimination index (p = 0.591). Categorical analyses confirmed no significant group differences. Both sets, however, showed only modest psychometric quality.

Conclusions:

Supervised fine-tuned LLMs are capable of generating multiple-choice questions with psychometric properties comparable to those written by experienced faculty. Given the limitations and cohort-dependency of psychometric indices, automated item generation should be considered a complement rather than a replacement for manual item writing. Further research with larger item sets and multi-institutional validation is needed to confirm generalizability and optimize integration of LLM-based tools into assessment development.


 Citation

Please cite as:

Hölzing CR, Meynhardt C, Meybohm P, König S, Kranke P

Fine-Tuned Large Language Models for Generating Multiple-Choice Questions in Anesthesiology: Psychometric Comparison With Faculty-Written Items

JMIR Form Res 2026;10:e84904

DOI: 10.2196/84904

PMID: 41707182

PMCID: 12916093

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.