Accepted for/Published in: JMIR Medical Education
Date Submitted: Jul 25, 2025
Date Accepted: Nov 30, 2025
Large language model-based patient simulation to foster communication skills in healthcare professionals: User-centered development and usability study
ABSTRACT
Background:
Case-based learning using standardized patients is a key method for teaching communication skills in medicine, but it faces logistical and financial hurdles. While Large Language Models (LLMs) show promise for creating scalable patient simulations, current research often overlooks user-centered design and direct comparison of different LLMs.
Objective:
To describe the user-centered design process and system architecture of a digital tool that leverages LLMs to simulate patient conversations for medical education, focusing specifically on taking a medical history Further, the objective is to study the differences between various LLMs in their ability to simulate patient encounters.
Methods:
We followed a user-centered design process, gathering initial requirements from two medical students. We then developed a fully functional web prototype using a Python Flask backend and a PostgreSQL database, integrating five LLMs from OpenAI, Anthropic, and xAI. The system consists of an AI-assisted case vignette generator and a dynamic patient simulator. To evaluate the system, we first conducted a task-based usability test with five medical students, measuring their experience with the standardized System Usability Scale (SUS) and qualitative questions. Second, we conducted a comparative analysis where four practicing physicians evaluated the simulation quality of three models (Grok 3, GPT-4, and Claude 3 Opus) across seven criteria on a 5-point scale.
Results:
Our usability testing yielded a mean SUS score of M = 91.5 (SD = 8.40), indicating "excellent" usability. The students unanimously praised the system's simplicity and intuitive design. However, they consistently identified the lack of a formal conclusion and feedback on their performance as a key weakness, expressing a desire for a "didactic loop" to maximize the learning effect. In our LLM comparison, Grok 3 achieved the highest overall rating (M = 4.25, SD = 0.75), excelling at depicting realistic timelines and responding to follow-up questions. GPT-4 followed with a mean score of M = 4.14 (SD = 0.8), showing strength in symptom coherence but weakness in portraying realistic uncertainty. Claude 3 Opus was rated lowest (M = 3.86, SD = 0.97) and exhibited the most performance variability.
Conclusions:
We successfully developed a highly usable patient simulation tool that serves as a foundation for further development. Our results show that while the tool is effective for communication training, its full potential will only be realized by integrating an automated feedback mechanism to create a complete didactic loop, as requested by users. Based on our evaluation, we recommend Grok 3 as the primary model for medical patient simulations, with GPT-4 as a reliable alternative.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.