Accepted for/Published in: JMIR Medical Education
Date Submitted: Aug 21, 2025
Open Peer Review Period: Aug 23, 2025 - Oct 18, 2025
Date Accepted: Dec 14, 2025
(closed for review but you can still tweet)
AI- Versus Human-Based Assessment of Medical Interview Transcripts in a Generative AI Simulated Patient System: Validation Study
ABSTRACT
Background:
Generative artificial intelligence (AI) is increasingly used in medical education, including AI-based virtual patients to improve interview skills. However, how much AI-based assessment (ABA) differs from human-based assessment (HBA) remains unclear.
Objective:
This study aimed to compare the quality of clinical interview assessments generated by an ABA (ChatGPT-o1 Pro (ABA-o1) and ChatGPT-5 Pro (ABA-5)) with those provided by an HBA conducted by clinical instructors in an AI-based virtual patient setting. We also examined whether AI reduced evaluation time and assessed agreement across participants with different levels of clinical experience.
Methods:
A standardized case of leg weakness was implemented in an AI-based virtual patient. Seven participants (two medical students, three residents, two attending physicians) each conducted an interview with the AI-patient, and transcripts were scored using the 25-item Master Interview Rating Scale (0–125). Three evaluation strategies were compared. First, ChatGPT-o1 Pro and ChatGPT-5 Pro scored each transcript five times with different random seeds to test case specificity. Processing time was logged automatically. Second, five blinded clinical instructors independently rated each transcript once using the same rubric, after completing a webinar to standardize scoring. Third, reliability metrics were applied. For AI, intraclass correlation coefficients (ICCs) quantified repeatability. For humans, ICC(2,1) was calculated. Agreement was quantified with Pearson r, Lin concordance correlation coefficient (CCC), Bland-Altman limits of agreement (LoA), Cronbach α, and ICC. Time efficiency was expressed as mean minutes per transcript and relative percentage reduction.
Results:
Mean interview scores were similar across methods: ABA-o1 52.1 (SD 6.9), ABA-5 53.2 (SD 6.8), HBA 53.7 (SD 6.8). Agreement with HBA was strong (r=0.90; CCC 0.88) with minimal bias (mean bias: ABA-o1 0.4, ABA-5 1.5; LoA: ABA-o1 –4.9 to 5.7, ABA-5 –8.6 to 11.7). Cronbach α was 0.81 (ABA-o1), 0.83 (ABA-5), and 0.80 (HBA); ICC(3,1) 0.77 (ABA-o1) and 0.82 (ABA-5); ICC(2,1) 0.38 (HBA). The coefficient of variation for ABA was about half that of HBA (6.6% vs 13.9%). Processing time for five runs was 4 min 19 s (ABA-o1) and 3 min 20 s (ABA-5) vs 10 min 16 s for physicians.
Conclusions:
ABA-o1 and ABA-5 produced scores closely matching HBA while demonstrating superior consistency and reliability. In the setting of virtual interview transcripts, these findings suggest that ABA may serve as a valid, rapid, and scalable alternative to HBA, reducing per-assessment time by over half. Applied strategically, AI-based scoring could enable timely feedback, improve efficiency, and reduce faculty workload. Further research is needed to confirm generalizability across broader settings.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.