JMIR Preprints #92016: Validity of Reasoning Generative Artificial Intelligence Models in Evaluating Japanese Objective Structured Clinical Examinations: A Preliminary Comparative Study with Clinical Educators

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Validity of Reasoning Generative Artificial Intelligence Models in Evaluating Japanese Objective Structured Clinical Examinations: A Preliminary Comparative Study with Clinical Educators

Takanobu Hirosawa;
Masashi Yokose;
Tetsu Sakamoto;
Arisa Hayashi;
Yukinori Harada;
Kazuki Tokumasu;
Kazuya Mizuta;
Taro Shimizu

ABSTRACT

Background:

Medical interview training is a cornerstone of clinical education but faces resource limitations in both implementation and evaluation. While Generative Artificial Intelligence (GAI) offers a potential solution for assessment, it remains unclear whether reasoning models improve evaluation validity, particularly within the linguistic context of the Japanese language.

Objective:

To evaluate the validity of state-of-the-art GAI models in Japanese medical interview training, we assessed scoring patterns and agreement with human clinical educators.

Methods:

This preliminary comparative study was conducted at a medical university in Japan using text data derived from medical interview training, including both chatbot-based and traditional styles. Postgraduate year 1 and 2 residents were involved. Two blinded human clinical educators independently evaluated the transcripts, reaching a consensus score through discussion. The consensus score was the reference standard. Two GAI models, GPT-5.2 Thinking and Gemini 3.0 Pro, independently evaluated the same transcripts. All evaluations used a standardized 6-domain Objective Structured Clinical Examination rubric (patient care, history taking, physical examination, accuracy and organization of clinical information, clinical reasoning, and management) scored on a 1–6 Likert scale, where 1 is inferior and 6 is excellent. We compared mean evaluation scores using the Wilcoxon signed-rank test and assessed inter-rater reliability using Intraclass Correlation Coefficients (ICCs) between the GAI models and the clinical educators.

Results:

Clinical educators and both GAI models rated the entire dataset of 40 transcripts by 20 included residents. Clinical educators assigned the highest overall mean scores (5.18, 95% CI 5.06-5.30). Compared to clinical educators, both GAI models demonstrated significant score deflation: GPT-5.2 Thinking assigned the lowest overall score (3.68, 95% CI 3.62-3.72; P<.001), followed by Gemini 3.0 Pro (4.09, 95% CI 3.97-4.21; P<.001). This discrepancy was most pronounced in the management domain, where GPT-5.2 Thinking assigned 2.93 (95% CI 2.79-3.06) compared to the clinical educators' 5.20 (95% CI 4.91-5.49). Agreement between the GAI models and human raters was poor across all domains, with overall ICCs of 0.04 (95% CI 0.00-0.09) for GPT-5.2 Thinking and 0.22 (95% CI 0.10-0.35) for Gemini 3.0 Pro.

Conclusions:

Unlike previous iterations of GAI, which tended to overestimate student performance, GPT-5.2 Thinking and Gemini 3.0 Pro graded stricter than human experts. Due to significant score discrepancies and poor inter-rater agreement, these models currently lack the validity to serve as standalone summative evaluators for Japanese Objective Structured Clinical Examinations, although their rigorous detection of deficiencies may offer value for formative feedback. Clinical Trial: Trial Registration: UMIN-CTR UMIN000053747; https://center6.umin.ac.jp/cgi-open-bin/ctr_e/ctr_view.cgi?recptno=R000061336.

Citation

Please cite as:

Hirosawa T, Yokose M, Sakamoto T, Hayashi A, Harada Y, Tokumasu K, Mizuta K, Shimizu T

Validity of Reasoning Generative Artificial Intelligence Models in Evaluating Japanese Objective Structured Clinical Examinations: A Preliminary Comparative Study with Clinical Educators

JMIR Preprints. 22/01/2026:92016

DOI: 10.2196/preprints.92016

URL: https://preprints.jmir.org/preprint/92016

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Currently submitted to: JMIR Formative Research

Date Submitted: Jan 22, 2026

Open Peer Review Period: Jan 23, 2026 - Mar 20, 2026

(closed for review but you can still tweet)

NOTE: This is an unreviewed Preprint

Validity of Reasoning Generative Artificial Intelligence Models in Evaluating Japanese Objective Structured Clinical Examinations: A Preliminary Comparative Study with Clinical Educators

ABSTRACT

Citation

Copyright