JMIR Preprints #79465: The Validity of Generative Artificial Intelligence in Evaluating Medical Students in Objective Structured Clinical Examination: A Pilot Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

The Validity of Generative Artificial Intelligence in Evaluating Medical Students in Objective Structured Clinical Examination: A Pilot Study

Masashi Yokose;
Takanobu Hirosawa;
Tetsu Sakamoto;
Ren Kawamura;
Yudai Suzuki;
Yukinori Harada;
Taro Shimizu

ABSTRACT

Background:

Objective Structured Clinical Examination (OSCE) has been widely used to evaluate students in medical education. However, OSCE requires extensive human resources, posing significant challenges to its implementation. We hypothesized that generative artificial intelligence (AI), such as ChatGPT-4, could serve as a supplemental assessor and alleviate the burden of physicians in the OSCE.

Objective:

This experimental study aims to evaluate the validity of generative AI in evaluating medical students in the OSCE.

Methods:

This study was conducted at a medical university in Japan. We recruited 11 fifth-year medical students during the general internal medicine clerkship from April 2023 to December 2023. Participants conducted mock medical interviews with a patient suffering from abdominal pain and wrote patient notes. Four physicians independently evaluated participants by reviewing medical interview videos and patient notes according to the prespecified evaluation format. We manually transcribed recorded interviews and input the transcriptions, along with patient notes and structured prompts, into ChatGPT-4 for evaluation. All inputs and outputs from ChatGPT-4 were in Japanese. The evaluation format consisted of the following six items: patient care and communication, history taking, physical examination, patient notes, clinical reasoning, and management. Each item was scored using a six-point Likert scale, ranging from 1 (very poor) to 6 (excellent). The evaluation scores by physicians and ChatGPT-4 were presented as median with 25th and 75th percentiles and were compared using the Wilcoxon signed-rank test. All P-values < 0.05 were considered statistically significant.

Results:

While ChatGPT-4 assigned higher scores than physicians in terms of physical examination (4.0 (4.0-5.0) vs 4.0 (3.0-4.0), P=.015), patient notes (6.0 (5.0-6.0) vs 4.0 (4.0-4.0), P=.002), clinical reasoning (5.0 (5.0-5.0) vs 4.0 (3.0-4.0), P<.001), and management (6.0 (5.0-6.0) vs 4.0 (2.5-4.5), P=.002), there were no significant differences in the score of patient care and communication (5.0 (5.0-5.0) vs 5.0 (4.0-5.0), P=.062) and history taking (5.0 (4.0-5.0) vs 5.0 (4.0-5.0), P=1.0).

Conclusions:

This study demonstrated that ChatGPT-4 may evaluate students competencies in patient care and communication and history taking as accurately as physicians do in the OSCE. Meanwhile, with the support of ChatGPT-4, physicians could enhance their evaluations by focusing more on physical examination, patient notes, clinical reasoning, and management in the OSCE. Generative AI, like ChatGPT-4, shows potential as a complementary assessor in the OSCE. Clinical Trial: UMIN Clinical Trials Registry (UMIN000050489).

Citation

Please cite as:

Yokose M, Hirosawa T, Sakamoto T, Kawamura R, Suzuki Y, Harada Y, Shimizu T

The Validity of Generative Artificial Intelligence in Evaluating Medical Students in Objective Structured Clinical Examination: Experimental Study

JMIR Form Res 2025;9:e79465

DOI: 10.2196/79465

PMID: 41343812

PMCID: 12715467

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Formative Research

Date Submitted: Jun 21, 2025

Open Peer Review Period: Jun 23, 2025 - Aug 18, 2025

Date Accepted: Nov 25, 2025

(closed for review but you can still tweet)

The Validity of Generative Artificial Intelligence in Evaluating Medical Students in Objective Structured Clinical Examination: A Pilot Study

ABSTRACT

Citation

Copyright