Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Formative Research

Date Submitted: May 15, 2025
Date Accepted: Nov 10, 2025

The final, peer-reviewed published version of this preprint can be found here:

Feasibility of a Specialized Large Language Model for Postgraduate Medical Examination Preparation: Single-Center Proof-Of-Concept Study

Leong YH, Nambiar L, Tay VY, Lie SA, Yuhe K

Feasibility of a Specialized Large Language Model for Postgraduate Medical Examination Preparation: Single-Center Proof-Of-Concept Study

JMIR Form Res 2025;9:e77580

DOI: 10.2196/77580

PMID: 41337739

PMCID: 12712563

Feasibility of a Specialized Large Language Model for Postgraduate Medical Exam Preparation: A Single-Center Proof-of-Concept Study

  • Yun Hao Leong; 
  • Lathiga Nambiar; 
  • Victoria Yu Tay; 
  • Sui An Lie; 
  • Ke Yuhe

ABSTRACT

Background:

Large Language Models (LLMs) are increasingly used in medical education for feedback and grading, yet their role in postgraduate examination preparation remains uncertain due to inconsistent grading, hallucinations, and user acceptance.

Objective:

This study evaluated the Personalized Anesthesia Study Support (PASS), a specialized GPT-4 model developed to assist candidates preparing for Singapore’s post-graduate specialist Anesthesiology examination. We assessed user acceptance, grading inter-rater reliability (IRR), and hallucination detection rates to determine the feasibility of integrating specialized LLMs into high-stakes exam preparation.

Methods:

PASS was built on OpenAI’s GPT-4 and adapted with domain-specific prompts and references. Twenty-one senior anesthesiology residents completed a mock Short Answer Question (SAQ) examination, which was independently graded by three human examiners and three PASS iterations. Participants reviewed feedback from both PASS and standard GPT-4 and completed a Technology Acceptance Model (TAM) survey. Grading reliability was evaluated using Cohen’s and Fleiss’ Kappa. Hallucination rates were assessed by participants and examiners.

Results:

Seventeen participants completed the TAM survey, totaling 136 responses. PASS scored significantly higher standard GPT-4 in usefulness (mean = 4.25, p < 0.001), efficiency (p < 0.001), and likelihood of future use (p < 0.001), with no difference in ease of use (p = 0.35). Internal grading agreement among PASS instances was moderate (κ = 0.522), higher than among human examiners (κ = 0.275). Agreement with a reference human examiner (Examiner 1) was comparable between PASS and human graders. Among the 316 PASS-generated responses, 67 hallucinations and 189 deviations were identified. Hallucination detection rates were comparable between candidates (14.9%) and examiners (22.9%, p = 0.212), but deviation detection was higher among examiners (67.5% vs. 31.3%, p < 0.0001).

Conclusions:

PASS demonstrated strong user acceptance and grading reliability, suggesting feasibility in high-stakes exam preparation. Experienced learners could identify major hallucinations, suggesting its viability in self-directed learning. Further research should refine grading accuracy and explore multicenter evaluation of specialized LLMs for postgraduate medical education.


 Citation

Please cite as:

Leong YH, Nambiar L, Tay VY, Lie SA, Yuhe K

Feasibility of a Specialized Large Language Model for Postgraduate Medical Examination Preparation: Single-Center Proof-Of-Concept Study

JMIR Form Res 2025;9:e77580

DOI: 10.2196/77580

PMID: 41337739

PMCID: 12712563

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.