JMIR Preprints #77580: Feasibility of a Specialized Large Language Model for Postgraduate Medical Exam Preparation: A Single-Center Proof-of-Concept Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Feasibility of a Specialized Large Language Model for Postgraduate Medical Exam Preparation: A Single-Center Proof-of-Concept Study

Yun Hao Leong;
Lathiga Nambiar;
Victoria Yu Tay;
Sui An Lie;
Ke Yuhe

ABSTRACT

Background:

Large Language Models (LLMs) are increasingly used in medical education for feedback and grading, yet their role in postgraduate examination preparation remains uncertain due to inconsistent grading, hallucinations, and user acceptance.

Objective:

This study evaluated the Personalized Anesthesia Study Support (PASS), a specialized GPT-4 model developed to assist candidates preparing for Singapore’s post-graduate specialist Anesthesiology examination. We assessed user acceptance, grading inter-rater reliability (IRR), and hallucination detection rates to determine the feasibility of integrating specialized LLMs into high-stakes exam preparation.

Methods:

PASS was built on OpenAI’s GPT-4 and adapted with domain-specific prompts and references. Twenty-one senior anesthesiology residents completed a mock Short Answer Question (SAQ) examination, which was independently graded by three human examiners and three PASS iterations. Participants reviewed feedback from both PASS and standard GPT-4 and completed a Technology Acceptance Model (TAM) survey. Grading reliability was evaluated using Cohen’s and Fleiss’ Kappa. Hallucination rates were assessed by participants and examiners.

Results:

Seventeen participants completed the TAM survey, totaling 136 responses. PASS scored significantly higher standard GPT-4 in usefulness (mean = 4.25, p < 0.001), efficiency (p < 0.001), and likelihood of future use (p < 0.001), with no difference in ease of use (p = 0.35). Internal grading agreement among PASS instances was moderate (κ = 0.522), higher than among human examiners (κ = 0.275). Agreement with a reference human examiner (Examiner 1) was comparable between PASS and human graders. Among the 316 PASS-generated responses, 67 hallucinations and 189 deviations were identified. Hallucination detection rates were comparable between candidates (14.9%) and examiners (22.9%, p = 0.212), but deviation detection was higher among examiners (67.5% vs. 31.3%, p < 0.0001).

Conclusions:

PASS demonstrated strong user acceptance and grading reliability, suggesting feasibility in high-stakes exam preparation. Experienced learners could identify major hallucinations, suggesting its viability in self-directed learning. Further research should refine grading accuracy and explore multicenter evaluation of specialized LLMs for postgraduate medical education.

Citation

Please cite as:

Leong YH, Nambiar L, Tay VY, Lie SA, Yuhe K

Feasibility of a Specialized Large Language Model for Postgraduate Medical Examination Preparation: Single-Center Proof-Of-Concept Study

JMIR Form Res 2025;9:e77580

DOI: 10.2196/77580

PMID: 41337739

PMCID: 12712563

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Formative Research

Date Submitted: May 15, 2025

Date Accepted: Nov 10, 2025

Feasibility of a Specialized Large Language Model for Postgraduate Medical Exam Preparation: A Single-Center Proof-of-Concept Study

ABSTRACT

Citation

Copyright