Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Currently submitted to: JMIR Formative Research

Date Submitted: Jun 4, 2026
Open Peer Review Period: Jun 5, 2026 - Jul 31, 2026
(currently open for review)

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Large Language Model-Based Virtual Patients to Probe Emergency Medicine Clinician Decision-Making About Pulmonary Embolism Testing: A Multisite Pilot Study

  • Alexander Thomas Janke; 
  • David Wu; 
  • William B Stubblefield; 
  • Lauren M Westafer; 
  • Florian F Schmitzberger; 
  • Adam Rodman; 
  • Keith E Kocher; 
  • Adrian D Haimovich

ABSTRACT

Background:

Emergency department (ED) clinicians commonly evaluate patients for pulmonary embolism (PE), balancing a stepwise diagnostic algorithm against the potential harms of computed tomography pulmonary angiography (CTPA). Despite widely endorsed bedside decision tools, CTPA use has grown while diagnostic yield has fallen and stabilized at 5 to 10 percent, suggesting persistent gaps in how clinicians integrate specific clinical features into the decision to test. Observational data cannot experimentally isolate the contribution of individual features, and traditional clinical simulation is expensive and difficult to scale. Large language models (LLMs) offer a potential middle ground, enabling interactive, conversational, programmatically controlled clinical scenarios delivered online.

Objective:

To develop and pilot-test an LLM-driven online clinical simulation platform designed to probe emergency medicine clinicians decision-making about PE testing, and to assess the platform's usability, reliability, and preliminary construct validity.

Methods:

We conducted a multi-site pilot study at the University of Michigan and Vanderbilt University Medical Center between October 2025 and January 2026. Emergency medicine residents and fellows completed 6 asynchronous, text-based simulation encounters delivered by a custom web application powered by OpenAI's GPT-4.1 model: 2 fixed control cases (a negative control with an ankle sprain and a positive control with a high pre-test probability PE presentation) and 4 treatment cases drawn randomly from a pool of 8 designed with a 2 x 2 x 2 factorial structure that varied chief complaint (shortness of breath versus mid-thoracic back pain), chest pain quality (non-pleuritic versus pleuritic), and chest radiograph findings (normal versus left lower lobe consolidation). Outcomes were platform usability (System Usability Scale, SUS), case reliability (transcript review across 6 prespecified domains), and PE-directed testing behavior (D-dimer ordered, CTPA ordered, and a composite of any PE testing), with unadjusted odds ratios (ORs) estimated by univariable logistic regression.

Results:

Of 28 recruited participants, 24 (85.7%) completed all 6 cases, yielding 144 simulated encounters (96 treatment, 48 control). The platform achieved a median SUS score of 78.8 (SD 13.6), and 23 of 24 participants (95.8%) scored at or above the acceptability threshold of 60. Full concordance with case specifications was observed in 5 of 6 reliability domains; deviations affected 2 of 144 encounters (1.4%). No participant initiated PE-directed testing in the negative control (0/24, 0%), and 22 of 24 (91.7%) ordered CTPA in the positive control. Across treatment cases, D-dimer was ordered in 41/96 encounters (42.7%) and CTPA in 13/96 (13.5%). Pleuritic chest pain was associated with higher odds of any PE testing (OR 3.98, 95% CI 1.70 to 9.32), as was a normal chest radiograph relative to left lower lobe consolidation (OR 4.44, 95% CI 1.88 to 10.47); the association for shortness of breath versus back pain was not statistically distinguishable from the null (OR 0.70, 95% CI 0.31 to 1.57).

Conclusions:

An LLM-driven online simulation platform delivered acceptable usability, faithful case reproduction, and clinically coherent variation in PE testing behavior across systematically varied clinical features. The platform supports future larger-scale, adequately powered studies of clinical reasoning and offers a scalable instrument for experimental research on diagnostic decision-making.


 Citation

Please cite as:

Janke AT, Wu D, Stubblefield WB, Westafer LM, Schmitzberger FF, Rodman A, Kocher KE, Haimovich AD

Large Language Model-Based Virtual Patients to Probe Emergency Medicine Clinician Decision-Making About Pulmonary Embolism Testing: A Multisite Pilot Study

JMIR Preprints. 04/06/2026:103522

DOI: 10.2196/preprints.103522

URL: https://preprints.jmir.org/preprint/103522

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.