Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Jun 14, 2025
Open Peer Review Period: Jul 4, 2025 - Aug 29, 2025
Date Accepted: Oct 23, 2025
(closed for review but you can still tweet)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
LLM-based Virtual Patient Systems for History-Taking in Medical Education: A Comprehensive Systematic Review
ABSTRACT
Background:
Background:
Large language models (LLMs) like GPT-3.5 and GPT-4 are transforming virtual patient systems in medical education, offering scalable, cost-effective alternatives to standardized patients. However, systematic evaluations of their performance and limitations are limited.
Objective:
Objective:
This review evaluates LLM-based virtual patient systems for medical history-taking, focusing on patient types and disease scope (RQ1), techniques enhancing history-taking (RQ2), experimental designs and metrics (RQ3), and public dataset characteristics (RQ4).
Methods:
Methods:
Following PRISMA guidelines, we analyzed 34 studies (2020–May 2025) from nine databases (PubMed, Scopus, Web of Science, IEEE Xplore, ACM Digital Library, SpringerLink, ERIC, arXiv, Springer) using predefined keywords.
Results:
Results:
RQ1: Systems simulate mental health, chronic, neurological, and emergency cases but lack multimorbidity and diverse profiles, limiting applicability. RQ2: Techniques rely on prompt design; few-shot learning and multi-agent frameworks have limited impact. Knowledge graph (KG) integration boosts accuracy by 16.02%, and fine-tuning helps, but further exploration is needed. RQ3: Evaluations use 81.8% Top-1 accuracy, 4.5/5 empathy, 88.1 SUS scores, and 0.9412 robustness but lack standardization and use small samples (10–50 students, 3–5 experts). RQ4: Datasets (e.g., MIMIC-II) are restricted by privacy, hindering comparisons.
Conclusions:
LLM-based virtual patient systems demonstrate significant potential but face several limitations. Current systems predominantly focus on common diseases, lacking adequate simulation of multimorbidity, cultural diversity, and complex drug interactions, thereby reducing clinical realism. Existing datasets such as MIMIC-III are biased toward single-disease scenarios, English language, and critical care, neglecting broader linguistic and cultural contexts. Methodologically, long prompts suffer from primacy and recency effects, while few-shot learning encounters challenges in maintaining dialogue coherence. To address these issues, incorporating LLM-KG embedding methods into model training can enhance contextual understanding, while combining chain-of-thought reasoning with LoRA improves inference efficiency. Multi-agent frameworks with dialogue compression offer further optimization for real-time interactions. Future research should prioritize the development of open-access, multilingual datasets through ethical data augmentation and international collaboration, supported by regular bias audits to ensure fairness. Establishing unified evaluation frameworks with standardized metrics—such as Top-K accuracy, semantic similarity scores above 0.75, and SUS scores exceeding 80—will be essential for advancing realism, accuracy, and fairness in virtual patient systems. Clinical Trial: -
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.