Accepted for/Published in: JMIR Medical Education
Date Submitted: Aug 21, 2025
Open Peer Review Period: Aug 23, 2025 - Oct 18, 2025
Date Accepted: Dec 14, 2025
(closed for review but you can still tweet)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Comparing AI- and Human-Based Assessments of Medical Interview Transcripts Using a Generative AI Simulated Patient System.
ABSTRACT
Background:
Generative AI is increasingly used in medical education, including the use of AI-based virtual patients to improve interview skills. However, it remains unclear how much AI-based assessment (ABA) differs from those of Human-based assessment (HBA).
Objective:
This study aimed to compare the quality of clinical interview assessments generated by an ABA using a virtual patient with those provided by a HBA conducted by clinical instructors. Additionally, it evaluated whether the use of AI could lead to a measurable reduction in evaluation time, and examined the level of agreement across participants with differing levels of clinical experience.
Methods:
A standardized leg-weakness case was implemented in an AI based virtual patient. Seven participants—two medical students, three resident physicians, and two attending physicians—each conducted an interview, and transcripts were scored with the Master Interview Rating Scale (MIRS; 25 items, 0–5 scale; total 0–125). Two evaluation strategies were compared. (1) ChatGPT o1-Pro scored each transcript five times with different random seeds to assess case specificity; total runtime for the five scores was automatically logged. (2) Five blinded clinical instructors , after a preparatory webinar reviewing the rubric and practicing on sample transcripts, each rated every transcript once and recorded clock time per rating. Because the five AI outputs are replicates of the same algorithm, intraclass correlation coefficients (ICC) were used to quantify repeatability rather than inter rater reliability. For human raters, we calculated ICC (2,1). Mean scores from both methods were compared, and agreement was quantified with Pearson’s r, Lin’s concordance correlation coefficient (ρc), Bland–Altman limits of agreement (LoA), internal consistency (Cronbach’s α), and ICC. Time efficiency was expressed as mean minutes per transcript and the relative percentage reduction achieved by AI scoring.
Results:
Mean interview scores were similar for ABA and HBA (52.1 ± 6.9 vs 53.7 ± 6.8). Agreement was strong (r = 0.92; ρc = 0.92) with minimal bias (+0.4 points; LoA −4.9 to +5.7). ABA showed higher internal consistency (α = 0.936 vs 0.863) and greater inter rater reliability (ICC = 0.77 vs 0.38). The coefficient of variation for ABA scores was roughly half that of HBA scores (6.6 % vs 13.9 %). In addition, ChatGPT completed each five run analysis in 4.3 ± 1.7 minutes compared with 10.3 ± 3.3 minutes for physicians, representing a 58 % reduction in assessment time.
Conclusions:
ABA scores that closely matched HBA scores while demonstrating superior consistency and reliability. In the setting of virtual clinical interview transcripts, these preliminary findings suggest that ABA shows potential as a valid, rapid, and scalable alternative to HBA. When applied strategically, it could potentially furnish timely formative feedback, quantify efficiency gains, and reduce faculty workload without compromising assessment quality. Further research is needed to determine whether this can be achieved without compromising assessment quality.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.