Accepted for/Published in: JMIR Formative Research
Date Submitted: Apr 27, 2025
Open Peer Review Period: Nov 6, 2025 - Jan 1, 2026
Date Accepted: Nov 10, 2025
(closed for review but you can still tweet)
Teaching Clinical Reasoning in the Age of AI: A Mixed-Methods Formative Evaluation of AI-Generated Script Concordance Tests and Expert Embodiment
ABSTRACT
Background:
The integration of artificial intelligence (AI) in medical education is evolving, offering new tools to enhance teaching and assessment. Among these, script concordance tests (SCT) are well suited to evaluate clinical reasoning in contexts of uncertainty. Traditionally, SCTs require expert panels for scoring and feedback, which can be resource intensive. Recent advances in generative AI, particularly large language models (LLM), suggest the possibility of replacing human experts with simulated ones, though this potential remains underexplored.
Objective:
This study aimed to evaluate whether LLMs can effectively simulate expert judgment in SCTs, by using generative AI to author, score, and provide feedback for SCTs in cardiology and pneumology. A secondary goal was to assess students’ perceptions of the test’s difficulty and the pedagogical value of AI-generated feedback.
Methods:
A cross-sectional, mixed-methods study was conducted with 25 second-year medical students who completed a 32-item SCT authored by ChatGPT-4o. Six LLMs (three trained on course material and three untrained) served as simulated experts to generate scoring keys and feedback. Students answered SCT questions, rated perceived difficulty, and selected the most helpful feedback explanation for each item. Quantitative analysis included scoring, difficulty ratings, and correlation between student and AI responses. Qualitative comments were thematically analyzed.
Results:
The average student score was 22.8 out of 32 (SD = 1.6), with scores ranging from 19.75 to 26.75. Trained AI systems showed significantly higher concordance with student responses (ρ = 0.64) than untrained models (ρ = 0.41). AI-generated feedback was rated as most helpful in 62.5% of cases, especially when provided by trained models. The SCT demonstrated good internal consistency (Cronbach’s α = 0.76), and students reported moderate perceived difficulty (mean=3.7/7). Qualitative feedback highlighted appreciation for SCTs as reflective tools, while recommending clearer guidance on Likert-scale use and more contextual detail in vignettes.
Conclusions:
This is among the first studies to demonstrate that trained generative AI models can reliably simulate expert clinical reasoning in a script concordance framework. The findings suggest that AI can both streamline SCT design and offer educational valuable feedback without compromising authenticity. Future studies should explore longitudinal effects on learning and assess how hybrid models (human and AI) can optimize reasoning instruction in medical education.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.