Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Education

Date Submitted: Aug 21, 2025
Open Peer Review Period: Aug 23, 2025 - Oct 18, 2025
Date Accepted: Dec 14, 2025
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

AI- vs Human-Based Assessment of Medical Interview Transcripts in a Generative AI–Simulated Patient System: Cross-Sectional Validation Study

Takahashi H, Shikino K, Kondo T, Yamada Y, Tomoda Y, Kishi M, Aiyama Y, Nagai S, Enomoto A, Tokushima Y, Shinohara T, Sano F, Matsuura T, Watanabe R, Naito T

AI- vs Human-Based Assessment of Medical Interview Transcripts in a Generative AI–Simulated Patient System: Cross-Sectional Validation Study

JMIR Med Educ 2026;12:e81673

DOI: 10.2196/81673

PMID: 41701946

PMCID: 12912650

AI- Versus Human-Based Assessment of Medical Interview Transcripts in a Generative AI Simulated Patient System: Validation Study

  • Hiromizu Takahashi; 
  • Kiyoshi Shikino; 
  • Takeshi Kondo; 
  • Yuji Yamada; 
  • Yoshitaka Tomoda; 
  • Minoru Kishi; 
  • Yuki Aiyama; 
  • Sho Nagai; 
  • Akiko Enomoto; 
  • Yoshinori Tokushima; 
  • Takahiro Shinohara; 
  • Fumiaki Sano; 
  • Takeshi Matsuura; 
  • Rikiya Watanabe; 
  • Toshio Naito

ABSTRACT

Background:

Generative artificial intelligence (AI) is increasingly used in medical education, including AI-based virtual patients to improve interview skills. However, how much AI-based assessment (ABA) differs from human-based assessment (HBA) remains unclear.

Objective:

This study aimed to compare the quality of clinical interview assessments generated by an ABA (ChatGPT-o1 Pro (ABA-o1) and ChatGPT-5 Pro (ABA-5)) with those provided by an HBA conducted by clinical instructors in an AI-based virtual patient setting. We also examined whether AI reduced evaluation time and assessed agreement across participants with different levels of clinical experience.

Methods:

A standardized case of leg weakness was implemented in an AI-based virtual patient. Seven participants (two medical students, three residents, two attending physicians) each conducted an interview with the AI-patient, and transcripts were scored using the 25-item Master Interview Rating Scale (0–125). Three evaluation strategies were compared. First, ChatGPT-o1 Pro and ChatGPT-5 Pro scored each transcript five times with different random seeds to test case specificity. Processing time was logged automatically. Second, five blinded clinical instructors independently rated each transcript once using the same rubric, after completing a webinar to standardize scoring. Third, reliability metrics were applied. For AI, intraclass correlation coefficients (ICCs) quantified repeatability. For humans, ICC(2,1) was calculated. Agreement was quantified with Pearson r, Lin concordance correlation coefficient (CCC), Bland-Altman limits of agreement (LoA), Cronbach α, and ICC. Time efficiency was expressed as mean minutes per transcript and relative percentage reduction.

Results:

Mean interview scores were similar across methods: ABA-o1 52.1 (SD 6.9), ABA-5 53.2 (SD 6.8), HBA 53.7 (SD 6.8). Agreement with HBA was strong (r=0.90; CCC 0.88) with minimal bias (mean bias: ABA-o1 0.4, ABA-5 1.5; LoA: ABA-o1 –4.9 to 5.7, ABA-5 –8.6 to 11.7). Cronbach α was 0.81 (ABA-o1), 0.83 (ABA-5), and 0.80 (HBA); ICC(3,1) 0.77 (ABA-o1) and 0.82 (ABA-5); ICC(2,1) 0.38 (HBA). The coefficient of variation for ABA was about half that of HBA (6.6% vs 13.9%). Processing time for five runs was 4 min 19 s (ABA-o1) and 3 min 20 s (ABA-5) vs 10 min 16 s for physicians.

Conclusions:

ABA-o1 and ABA-5 produced scores closely matching HBA while demonstrating superior consistency and reliability. In the setting of virtual interview transcripts, these findings suggest that ABA may serve as a valid, rapid, and scalable alternative to HBA, reducing per-assessment time by over half. Applied strategically, AI-based scoring could enable timely feedback, improve efficiency, and reduce faculty workload. Further research is needed to confirm generalizability across broader settings.


 Citation

Please cite as:

Takahashi H, Shikino K, Kondo T, Yamada Y, Tomoda Y, Kishi M, Aiyama Y, Nagai S, Enomoto A, Tokushima Y, Shinohara T, Sano F, Matsuura T, Watanabe R, Naito T

AI- vs Human-Based Assessment of Medical Interview Transcripts in a Generative AI–Simulated Patient System: Cross-Sectional Validation Study

JMIR Med Educ 2026;12:e81673

DOI: 10.2196/81673

PMID: 41701946

PMCID: 12912650

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.