Accepted for/Published in: JMIR Formative Research
Date Submitted: Jun 21, 2025
Open Peer Review Period: Jun 23, 2025 - Aug 18, 2025
Date Accepted: Nov 25, 2025
(closed for review but you can still tweet)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
The Validity of Generative Artificial Intelligence in Evaluating Medical Students in Objective Structured Clinical Examination: A Pilot Study
ABSTRACT
Background:
Objective Structured Clinical Examination (OSCE) has been widely used to evaluate students in medical education. However, OSCE requires extensive human resources, posing significant challenges to its implementation. We hypothesized that generative artificial intelligence (AI), such as ChatGPT-4, could serve as a supplemental assessor and alleviate the burden of physicians in the OSCE.
Objective:
This experimental study aims to evaluate the validity of generative AI in evaluating medical students in the OSCE.
Methods:
This study was conducted at a medical university in Japan. We recruited 11 fifth-year medical students during the general internal medicine clerkship from April 2023 to December 2023. Participants conducted mock medical interviews with a patient suffering from abdominal pain and wrote patient notes. Four physicians independently evaluated participants by reviewing medical interview videos and patient notes according to the prespecified evaluation format. We manually transcribed recorded interviews and input the transcriptions, along with patient notes and structured prompts, into ChatGPT-4 for evaluation. All inputs and outputs from ChatGPT-4 were in Japanese. The evaluation format consisted of the following six items: patient care and communication, history taking, physical examination, patient notes, clinical reasoning, and management. Each item was scored using a six-point Likert scale, ranging from 1 (very poor) to 6 (excellent). The evaluation scores by physicians and ChatGPT-4 were presented as median with 25th and 75th percentiles and were compared using the Wilcoxon signed-rank test. All P-values < 0.05 were considered statistically significant.
Results:
While ChatGPT-4 assigned higher scores than physicians in terms of physical examination (4.0 (4.0-5.0) vs 4.0 (3.0-4.0), P=.015), patient notes (6.0 (5.0-6.0) vs 4.0 (4.0-4.0), P=.002), clinical reasoning (5.0 (5.0-5.0) vs 4.0 (3.0-4.0), P<.001), and management (6.0 (5.0-6.0) vs 4.0 (2.5-4.5), P=.002), there were no significant differences in the score of patient care and communication (5.0 (5.0-5.0) vs 5.0 (4.0-5.0), P=.062) and history taking (5.0 (4.0-5.0) vs 5.0 (4.0-5.0), P=1.0).
Conclusions:
This study demonstrated that ChatGPT-4 may evaluate students competencies in patient care and communication and history taking as accurately as physicians do in the OSCE. Meanwhile, with the support of ChatGPT-4, physicians could enhance their evaluations by focusing more on physical examination, patient notes, clinical reasoning, and management in the OSCE. Generative AI, like ChatGPT-4, shows potential as a complementary assessor in the OSCE. Clinical Trial: UMIN Clinical Trials Registry (UMIN000050489).
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.