Accepted for/Published in: JMIR Medical Education
Date Submitted: Sep 18, 2025
Open Peer Review Period: Sep 24, 2025 - Nov 19, 2025
Date Accepted: Mar 9, 2026
(closed for review but you can still tweet)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Assessing Students’Clinical Reasoning Skills in History Taking with Large Language Model–Based Virtual Patients: Development and Validation of a Structured Coding Scheme Using Systematic Text Condensation
ABSTRACT
Background:
Large language model (LLM)-driven virtual patients (VPs) are increasingly used to simulate history taking. However, there is currently no straightforward methodological approach to effectively identify students’ clinical reasoning activities during these interactions, which limits the ability to provide personalised feedback.
Objective:
This study aims to develop a structured coding scheme to characterise medical students’ behaviours during interactions with LLM-driven VPs.
Methods:
Second-year medical students (N=210) completed text-based history-taking sessions across five simulated chest pain cases, yielding 1,030 dialogues. Dialogues from Cases 1–4 were analysed using systematic text condensation (STC) to develop a coding scheme inductively. Two raters independently coded a subset of dialogues, and inter-coder reliability was assessed using Cohen’s kappa. The established scheme was then applied to the dialogues from Case 5, and Pearson correlation coefficients (r) were used to assess associations between code frequencies and external performance outcomes: diagnostic accuracy, history-taking checklist scores, clinical knowledge test scores, and post-encounter form (PEF) scores.
Results:
The STC analysis produced a 12-code scheme comprising four clinical reasoning codes (Pathophysiologic Question, Relevant Response, Summarising & Integrating, Logical Organisation), six information-gathering codes, and two communication codes. Inter-coder reliability was high for all dimensions: clinical reasoning (κ = 1.00), information gathering (κ = 0.95-0.98), and communication (κ = 1.00). In Case 5, Summarising & Integrating was most predictive, correlating with diagnostic accuracy (χ2 =6.019, P=.014), checklist scores (r=0.208, P=.003), knowledge test scores (r=0.225, P=.002), and PEF scores (r=0.191, P=.009). Logical organisation (LO) also correlated with diagnostic accuracy (χ2 =0.188, P=.008), checklist (r=0.592, P<.001), and knowledge test scores (r=0.170, P=.013). Patho-physiologic question showed weaker but significant associations with checklist and knowledge tests (r=0.177, p=.013 and r=0.145,p=.042). Only two information-gathering codes demonstrated weak-to-moderate associations with checklist and knowledge test scores, while only one communication code showed a weak association with knowledge tests.
Conclusions:
This study developed a theory-informed coding scheme that reliably distinguishes information-gathering and reasoning behaviours in history-taking with virtual patients. Enabling the identification of diverse behaviours provides a foundation for formative assessment and personalised feedback, offering a scalable approach to support the development of clinical reasoning in medical students. Clinical Trial: NO
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.