Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Education

Date Submitted: Aug 21, 2025
Open Peer Review Period: Aug 23, 2025 - Oct 18, 2025
Date Accepted: Dec 14, 2025
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

AI- vs Human-Based Assessment of Medical Interview Transcripts in a Generative AI–Simulated Patient System: Cross-Sectional Validation Study

Takahashi H, Shikino K, Kondo T, Yamada Y, Tomoda Y, Kishi M, Aiyama Y, Nagai S, Enomoto A, Tokushima Y, Shinohara T, Sano F, Matsuura T, Watanabe R, Naito T

AI- vs Human-Based Assessment of Medical Interview Transcripts in a Generative AI–Simulated Patient System: Cross-Sectional Validation Study

JMIR Med Educ 2026;12:e81673

DOI: 10.2196/81673

PMID: 41701946

PMCID: 12912650

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Comparing AI- and Human-Based Assessments of Medical Interview Transcripts Using a Generative AI Simulated Patient System.

  • Hiromizu Takahashi; 
  • Kiyoshi Shikino; 
  • Takeshi Kondo; 
  • Yuji Yamada; 
  • Yoshitaka Tomoda; 
  • Minoru Kishi; 
  • Yuki Aiyama; 
  • Sho Nagai; 
  • Akiko Enomoto; 
  • Yoshinori Tokushima; 
  • Takahiro Shinohara; 
  • Fumiaki Sano; 
  • Takeshi Matsuura; 
  • Rikiya Watanabe; 
  • Toshio Naito

ABSTRACT

Background:

Generative AI is increasingly used in medical education, including the use of AI-based virtual patients to improve interview skills. However, it remains unclear how much AI-based assessment (ABA) differs from those of Human-based assessment (HBA).

Objective:

This study aimed to compare the quality of clinical interview assessments generated by an ABA using a virtual patient with those provided by a HBA conducted by clinical instructors. Additionally, it evaluated whether the use of AI could lead to a measurable reduction in evaluation time, and examined the level of agreement across participants with differing levels of clinical experience.

Methods:

A standardized leg-weakness case was implemented in an AI based virtual patient. Seven participants—two medical students, three resident physicians, and two attending physicians—each conducted an interview, and transcripts were scored with the Master Interview Rating Scale (MIRS; 25 items, 0–5 scale; total 0–125). Two evaluation strategies were compared. (1) ChatGPT o1-Pro scored each transcript five times with different random seeds to assess case specificity; total runtime for the five scores was automatically logged. (2) Five blinded clinical instructors , after a preparatory webinar reviewing the rubric and practicing on sample transcripts, each rated every transcript once and recorded clock time per rating. Because the five AI outputs are replicates of the same algorithm, intraclass correlation coefficients (ICC) were used to quantify repeatability rather than inter rater reliability. For human raters, we calculated ICC (2,1). Mean scores from both methods were compared, and agreement was quantified with Pearson’s r, Lin’s concordance correlation coefficient (ρc), Bland–Altman limits of agreement (LoA), internal consistency (Cronbach’s α), and ICC. Time efficiency was expressed as mean minutes per transcript and the relative percentage reduction achieved by AI scoring.

Results:

Mean interview scores were similar for ABA and HBA (52.1 ± 6.9 vs 53.7 ± 6.8). Agreement was strong (r = 0.92; ρc = 0.92) with minimal bias (+0.4 points; LoA −4.9 to +5.7). ABA showed higher internal consistency (α = 0.936 vs 0.863) and greater inter rater reliability (ICC = 0.77 vs 0.38). The coefficient of variation for ABA scores was roughly half that of HBA scores (6.6 % vs 13.9 %). In addition, ChatGPT completed each five run analysis in 4.3 ± 1.7 minutes compared with 10.3 ± 3.3 minutes for physicians, representing a 58 % reduction in assessment time.

Conclusions:

ABA scores that closely matched HBA scores while demonstrating superior consistency and reliability. In the setting of virtual clinical interview transcripts, these preliminary findings suggest that ABA shows potential as a valid, rapid, and scalable alternative to HBA. When applied strategically, it could potentially furnish timely formative feedback, quantify efficiency gains, and reduce faculty workload without compromising assessment quality. Further research is needed to determine whether this can be achieved without compromising assessment quality.


 Citation

Please cite as:

Takahashi H, Shikino K, Kondo T, Yamada Y, Tomoda Y, Kishi M, Aiyama Y, Nagai S, Enomoto A, Tokushima Y, Shinohara T, Sano F, Matsuura T, Watanabe R, Naito T

AI- vs Human-Based Assessment of Medical Interview Transcripts in a Generative AI–Simulated Patient System: Cross-Sectional Validation Study

JMIR Med Educ 2026;12:e81673

DOI: 10.2196/81673

PMID: 41701946

PMCID: 12912650

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.