Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Nov 6, 2024
Open Peer Review Period: Nov 7, 2024 - Jan 2, 2025
Date Accepted: Jan 13, 2025
Date Submitted to PubMed: Jan 24, 2025
(closed for review but you can still tweet)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Virtual patients using large language models: Scalable, contextualized simulation of clinician-patient dialog with feedback
ABSTRACT
Background:
Virtual patients (VPs) are computer screen-based simulations of patient-clinician encounters. VP use is limited by cost and low scalability.
Objective:
Show proof-of-concept that VPs powered by large language models (LLMs) generate authentic dialogs, accurate representations of patient preferences, and personalized feedback on clinical performance; and explore LLMs for rating dialog and feedback quality.
Methods:
We conducted an intrinsic evaluation study rating 60 VP-clinician conversations. We used carefully engineered prompts to direct OpenAI Generative Pre-trained Transformer (GPT) to emulate a patient and provide feedback. Using 2 outpatient medicine topics (chronic cough [diagnosis] and diabetes [management]), each with permutations representing different patient preferences, we created 60 conversations (dialogs plus feedback): 48 with a human clinician, and 12 "self-chat" dialogs with GPT role-playing both the VP and clinician. Primary outcomes were dialog authenticity and feedback quality, rated using novel instruments meticulously grounded in empirical and conceptual work. Each conversation was rated by 3 physicians and also by GPT. Secondary outcomes included patient preferences represented in the dialogs, cost, and user experience.
Results:
The average cost per conversation was $0.51 for GPT-4.0-turbo and $0.02 for GPT-3.5-turbo. Conversation ratings (maximum 6) were mean (SD) overall authenticity 4.7 (0.7); overall user experience 4.9 (0.7); and average feedback 4.7 (0.6). For dialogs created using GPT-4.0-turbo, physician ratings of patient preferences aligned with intended preferences in 20-47 of 48 dialogs (42-98%). Subgroup comparisons revealed higher ratings for dialogs using GPT-4.0-turbo vs GPT-3.5-turbo, and for human-generated vs self-chat dialogs. Feedback ratings were similar for human-generated vs GPT-generated ratings, whereas authenticity ratings were significantly lower.
Conclusions:
LLM-powered VPs can simulate patient-clinician dialogs, demonstrably represent patient preferences, and provide personalized performance feedback. This approach is scalable, globally-accessible, and inexpensive. LLM-generated ratings of feedback quality are similar to human ratings. Clinical Trial: None
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.