JMIR Preprints #91003: Assessment of Large Language Model Performance on Virtual Patient Scenarios: Mixed Methods Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Assessment of Large Language Model Performance on Virtual Patient Scenarios: Mixed Methods Study

Eleni Dafli;
Ioanna Dratsiou;
Niki Pandria;
Danai Pikoula;
Panagiotis Bamidis

ABSTRACT

Background:

Generative AI is increasingly being explored in medical education as a tool to enhance clinical reasoning and support interactive learning. However, only a few studies have so far evaluated how such models perform in suitable educational settings. This study examines the effectiveness of ChatGPT in simulating clinical decision-making through virtual patient (VP) interactions.

Objective:

This study aims to evaluate and compare the accuracy of ChatGPT-3.5 and ChatGPT-4 in solving VP scenarios across medical specialties, and to explore their strengths, limitations, and educational implications.

Methods:

A total of 64 VP scenarios covering paediatric, adult, disease management, and oncology cases were tested using ChatGPT-3.5 and ChatGPT-4 within the MobiViP mobile platform. Responses were classified as correct, incorrect, or inadequate. Success rates were calculated using descriptive and inferential statistics. Inadequate responses were also analyzed thematically.

Results:

ChatGPT-4 significantly outperformed ChatGPT-3.5 across all categories (median success rate: 92.55% vs 78.68%, p<.001). GPT-4 showed higher reliability particularly in complex scenarios such as oncology and disease management. GPT-3.5 generated a greater number and variety of inadequate responses, including navigation errors and irrelevant outputs.

Conclusions:

This study highlights the potential of generative AI to complement traditional medical education, particularly in fostering clinical reasoning through case-based learning. While both models show promise in supporting clinical reasoning education, GPT-4 provides significantly more accurate and contextually appropriate outputs. Nonetheless, both versions can produce erroneous or confusing responses, underscoring the importance of guided implementation. Furthermore, while generative AI holds potential for scenario creation, caution is recommended due to risks of bias, inaccuracy, or pedagogical misalignment. Educators should place emphasis on users becoming AI-literate and integrate these tools thoughtfully to support, rather than replace clinical training.

Citation

Please cite as:

Dafli E, Dratsiou I, Pandria N, Pikoula D, Bamidis P

Assessment of Large Language Model Performance on Virtual Patient Scenarios: Mixed Methods Study

JMIR Preprints. 09/01/2026:91003

DOI: 10.2196/preprints.91003

URL: https://preprints.jmir.org/preprint/91003

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Currently submitted to: JMIR Formative Research

Date Submitted: Jan 9, 2026

Assessment of Large Language Model Performance on Virtual Patient Scenarios: Mixed Methods Study

ABSTRACT

Citation

Copyright