Currently submitted to: JMIR Formative Research
Date Submitted: Jan 22, 2026
Open Peer Review Period: Jan 23, 2026 - Mar 20, 2026
(currently open for review)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Validity of Reasoning Generative Artificial Intelligence Models in Evaluating Japanese Objective Structured Clinical Examinations: A Preliminary Comparative Study with Clinical Educators
ABSTRACT
Background:
Medical interview training is a cornerstone of clinical education but faces resource limitations in both implementation and evaluation. While Generative Artificial Intelligence (GAI) offers a potential solution for assessment, it remains unclear whether reasoning models improve evaluation validity, particularly within the linguistic context of the Japanese language.
Objective:
To evaluate the validity of state-of-the-art GAI models in Japanese medical interview training, we assessed scoring patterns and agreement with human clinical educators.
Methods:
This preliminary comparative study was conducted at a medical university in Japan using text data derived from medical interview training, including both chatbot-based and traditional styles. Postgraduate year 1 and 2 residents were involved. Two blinded human clinical educators independently evaluated the transcripts, reaching a consensus score through discussion. The consensus score was the reference standard. Two GAI models, GPT-5.2 Thinking and Gemini 3.0 Pro, independently evaluated the same transcripts. All evaluations used a standardized 6-domain Objective Structured Clinical Examination rubric (patient care, history taking, physical examination, accuracy and organization of clinical information, clinical reasoning, and management) scored on a 1–6 Likert scale, where 1 is inferior and 6 is excellent. We compared mean evaluation scores using the Wilcoxon signed-rank test and assessed inter-rater reliability using Intraclass Correlation Coefficients (ICCs) between the GAI models and the clinical educators.
Results:
Clinical educators and both GAI models rated the entire dataset of 40 transcripts by 20 included residents. Clinical educators assigned the highest overall mean scores (5.18, 95% CI 5.06-5.30). Compared to clinical educators, both GAI models demonstrated significant score deflation: GPT-5.2 Thinking assigned the lowest overall score (3.68, 95% CI 3.62-3.72; P<.001), followed by Gemini 3.0 Pro (4.09, 95% CI 3.97-4.21; P<.001). This discrepancy was most pronounced in the management domain, where GPT-5.2 Thinking assigned 2.93 (95% CI 2.79-3.06) compared to the clinical educators' 5.20 (95% CI 4.91-5.49). Agreement between the GAI models and human raters was poor across all domains, with overall ICCs of 0.04 (95% CI 0.00-0.09) for GPT-5.2 Thinking and 0.22 (95% CI 0.10-0.35) for Gemini 3.0 Pro.
Conclusions:
Unlike previous iterations of GAI, which tended to overestimate student performance, GPT-5.2 Thinking and Gemini 3.0 Pro graded stricter than human experts. Due to significant score discrepancies and poor inter-rater agreement, these models currently lack the validity to serve as standalone summative evaluators for Japanese Objective Structured Clinical Examinations, although their rigorous detection of deficiencies may offer value for formative feedback. Clinical Trial: Trial Registration: UMIN-CTR UMIN000053747; https://center6.umin.ac.jp/cgi-open-bin/ctr_e/ctr_view.cgi?recptno=R000061336.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.