Accepted for/Published in: JMIR Formative Research
Date Submitted: May 15, 2025
Date Accepted: Nov 10, 2025
Feasibility of a Specialized Large Language Model for Postgraduate Medical Exam Preparation: A Single-Center Proof-of-Concept Study
ABSTRACT
Background:
Large Language Models (LLMs) are increasingly used in medical education for feedback and grading, yet their role in postgraduate examination preparation remains uncertain due to inconsistent grading, hallucinations, and user acceptance.
Objective:
This study evaluated the Personalized Anesthesia Study Support (PASS), a specialized GPT-4 model developed to assist candidates preparing for Singapore’s post-graduate specialist Anesthesiology examination. We assessed user acceptance, grading inter-rater reliability (IRR), and hallucination detection rates to determine the feasibility of integrating specialized LLMs into high-stakes exam preparation.
Methods:
PASS was built on OpenAI’s GPT-4 and adapted with domain-specific prompts and references. Twenty-one senior anesthesiology residents completed a mock Short Answer Question (SAQ) examination, which was independently graded by three human examiners and three PASS iterations. Participants reviewed feedback from both PASS and standard GPT-4 and completed a Technology Acceptance Model (TAM) survey. Grading reliability was evaluated using Cohen’s and Fleiss’ Kappa. Hallucination rates were assessed by participants and examiners.
Results:
Seventeen participants completed the TAM survey, totaling 136 responses. PASS scored significantly higher standard GPT-4 in usefulness (mean = 4.25, p < 0.001), efficiency (p < 0.001), and likelihood of future use (p < 0.001), with no difference in ease of use (p = 0.35). Internal grading agreement among PASS instances was moderate (κ = 0.522), higher than among human examiners (κ = 0.275). Agreement with a reference human examiner (Examiner 1) was comparable between PASS and human graders. Among the 316 PASS-generated responses, 67 hallucinations and 189 deviations were identified. Hallucination detection rates were comparable between candidates (14.9%) and examiners (22.9%, p = 0.212), but deviation detection was higher among examiners (67.5% vs. 31.3%, p < 0.0001).
Conclusions:
PASS demonstrated strong user acceptance and grading reliability, suggesting feasibility in high-stakes exam preparation. Experienced learners could identify major hallucinations, suggesting its viability in self-directed learning. Further research should refine grading accuracy and explore multicenter evaluation of specialized LLMs for postgraduate medical education.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.