Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Formative Research

Date Submitted: Jun 21, 2025
Open Peer Review Period: Jun 23, 2025 - Aug 18, 2025
Date Accepted: Nov 25, 2025
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

The Validity of Generative Artificial Intelligence in Evaluating Medical Students in Objective Structured Clinical Examination: Experimental Study

Yokose M, Hirosawa T, Sakamoto T, Kawamura R, Suzuki Y, Harada Y, Shimizu T

The Validity of Generative Artificial Intelligence in Evaluating Medical Students in Objective Structured Clinical Examination: Experimental Study

JMIR Form Res 2025;9:e79465

DOI: 10.2196/79465

PMID: 41343812

PMCID: 12715467

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

The Validity of Generative Artificial Intelligence in Evaluating Medical Students in Objective Structured Clinical Examination: A Pilot Study

  • Masashi Yokose; 
  • Takanobu Hirosawa; 
  • Tetsu Sakamoto; 
  • Ren Kawamura; 
  • Yudai Suzuki; 
  • Yukinori Harada; 
  • Taro Shimizu

ABSTRACT

Background:

Objective Structured Clinical Examination (OSCE) has been widely used to evaluate students in medical education. However, OSCE requires extensive human resources, posing significant challenges to its implementation. We hypothesized that generative artificial intelligence (AI), such as ChatGPT-4, could serve as a supplemental assessor and alleviate the burden of physicians in the OSCE.

Objective:

This experimental study aims to evaluate the validity of generative AI in evaluating medical students in the OSCE.

Methods:

This study was conducted at a medical university in Japan. We recruited 11 fifth-year medical students during the general internal medicine clerkship from April 2023 to December 2023. Participants conducted mock medical interviews with a patient suffering from abdominal pain and wrote patient notes. Four physicians independently evaluated participants by reviewing medical interview videos and patient notes according to the prespecified evaluation format. We manually transcribed recorded interviews and input the transcriptions, along with patient notes and structured prompts, into ChatGPT-4 for evaluation. All inputs and outputs from ChatGPT-4 were in Japanese. The evaluation format consisted of the following six items: patient care and communication, history taking, physical examination, patient notes, clinical reasoning, and management. Each item was scored using a six-point Likert scale, ranging from 1 (very poor) to 6 (excellent). The evaluation scores by physicians and ChatGPT-4 were presented as median with 25th and 75th percentiles and were compared using the Wilcoxon signed-rank test. All P-values < 0.05 were considered statistically significant.

Results:

While ChatGPT-4 assigned higher scores than physicians in terms of physical examination (4.0 (4.0-5.0) vs 4.0 (3.0-4.0), P=.015), patient notes (6.0 (5.0-6.0) vs 4.0 (4.0-4.0), P=.002), clinical reasoning (5.0 (5.0-5.0) vs 4.0 (3.0-4.0), P<.001), and management (6.0 (5.0-6.0) vs 4.0 (2.5-4.5), P=.002), there were no significant differences in the score of patient care and communication (5.0 (5.0-5.0) vs 5.0 (4.0-5.0), P=.062) and history taking (5.0 (4.0-5.0) vs 5.0 (4.0-5.0), P=1.0).

Conclusions:

This study demonstrated that ChatGPT-4 may evaluate students competencies in patient care and communication and history taking as accurately as physicians do in the OSCE. Meanwhile, with the support of ChatGPT-4, physicians could enhance their evaluations by focusing more on physical examination, patient notes, clinical reasoning, and management in the OSCE. Generative AI, like ChatGPT-4, shows potential as a complementary assessor in the OSCE. Clinical Trial: UMIN Clinical Trials Registry (UMIN000050489).


 Citation

Please cite as:

Yokose M, Hirosawa T, Sakamoto T, Kawamura R, Suzuki Y, Harada Y, Shimizu T

The Validity of Generative Artificial Intelligence in Evaluating Medical Students in Objective Structured Clinical Examination: Experimental Study

JMIR Form Res 2025;9:e79465

DOI: 10.2196/79465

PMID: 41343812

PMCID: 12715467

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.