Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Assessment of Large Language Model Performance on Virtual Patient Scenarios: Mixed Methods Study
ABSTRACT
Background:
Generative AI is increasingly being explored in medical education as a tool to enhance clinical reasoning and support interactive learning. However, only a few studies have so far evaluated how such models perform in suitable educational settings. This study examines the effectiveness of ChatGPT in simulating clinical decision-making through virtual patient (VP) interactions.
Objective:
This study aims to evaluate and compare the accuracy of ChatGPT-3.5 and ChatGPT-4 in solving VP scenarios across medical specialties, and to explore their strengths, limitations, and educational implications.
Methods:
A total of 64 VP scenarios covering paediatric, adult, disease management, and oncology cases were tested using ChatGPT-3.5 and ChatGPT-4 within the MobiViP mobile platform. Responses were classified as correct, incorrect, or inadequate. Success rates were calculated using descriptive and inferential statistics. Inadequate responses were also analyzed thematically.
Results:
ChatGPT-4 significantly outperformed ChatGPT-3.5 across all categories (median success rate: 92.55% vs 78.68%, p<.001). GPT-4 showed higher reliability particularly in complex scenarios such as oncology and disease management. GPT-3.5 generated a greater number and variety of inadequate responses, including navigation errors and irrelevant outputs.
Conclusions:
This study highlights the potential of generative AI to complement traditional medical education, particularly in fostering clinical reasoning through case-based learning. While both models show promise in supporting clinical reasoning education, GPT-4 provides significantly more accurate and contextually appropriate outputs. Nonetheless, both versions can produce erroneous or confusing responses, underscoring the importance of guided implementation. Furthermore, while generative AI holds potential for scenario creation, caution is recommended due to risks of bias, inaccuracy, or pedagogical misalignment. Educators should place emphasis on users becoming AI-literate and integrate these tools thoughtfully to support, rather than replace clinical training.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.