Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Formative Research

Date Submitted: Sep 14, 2024
Open Peer Review Period: Sep 19, 2024 - Nov 14, 2024
Date Accepted: Jan 31, 2025
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Novel Evaluation Metric and Quantified Performance of ChatGPT-4 Patient Management Simulations for Early Clinical Education: Experimental Study

Scherr R, Spina A, Dao A, Andalib S, Halaseh FF, Blair S, Rivera RJ

Novel Evaluation Metric and Quantified Performance of ChatGPT-4 Patient Management Simulations for Early Clinical Education: Experimental Study

JMIR Form Res 2025;9:e66478

DOI: 10.2196/66478

PMID: 40013991

PMCID: 11884304

Novel Evaluation Metric and Quantified Performance of ChatGPT-4 Patient Management Simulations for Early Clinical Education: Experimental Study

  • Riley Scherr; 
  • Aidin Spina; 
  • Allen Dao; 
  • Saman Andalib; 
  • Faris F Halaseh; 
  • Sarah Blair; 
  • Ronald J Rivera

ABSTRACT

Background:

Case studies have shown ChatGPT can run clinical simulations at the medical student level. However, no data have assessed ChatGPT’s reliability in meeting desired simulation criteria such as medical accuracy, simulation formatting, and robust feedback mechanisms.

Objective:

To quantify ChatGPT’s ability to consistently follow formatting instructions and create simulations for preclinical medical student learners according to principles of medical simulation and multimedia educational technology.

Methods:

Using ChatGPT-4 and a pre-validated starting prompt, the authors ran 360 separate simulations of an acute asthma exacerbation. 180 simulations were given correct answers and 180 were given incorrect answers. ChatGPT was evaluated for its ability to adhere to basic simulation parameters (stepwise progression, free response, interactivity), advanced simulation parameters (autonomous conclusion, delayed feedback, comprehensive feedback), and medical accuracy (vignette, treatment updates, feedback). Significance was determined with chi-squared analyses using 95% confidence intervals for odds ratios.

Results:

100% of simulations met basic simulation parameters and were medically accurate. For advanced parameters, 55% of all simulations delayed feedback, while the Correct arm (87%) delayed feedback significantly more than the Incorrect arm (24%) (p<0.001). 79% of simulations concluded autonomously, and there was no difference between the Correct and Incorrect arms in autonomous conclusion (81%, 77%; p=0.364). 78% of simulations gave comprehensive feedback, and there was no difference between the Correct and Incorrect arms in comprehensive feedback (76%, 81%; p=0.306). ChatGPT-4 was significantly more likely to conclude simulations autonomously (p<0.001) and provide comprehensive feedback (p<0.001) when feedback was delayed compared to when feedback was not delayed.

Conclusions:

ChatGPT simulations performed perfectly on medical accuracy and basic simulation parameters. It performed well on comprehensive feedback and autonomous conclusion. Delayed feedback depended on the accuracy of user inputs. A simulation meeting one advanced parameter was more likely to meet all advanced parameters. These simulations have the potential to be a reliable educational tool for simple simulations and can be evaluated by a novel nine-part metric. Further work must be done to ensure consistent performance across a broader range of simulation scenarios.


 Citation

Please cite as:

Scherr R, Spina A, Dao A, Andalib S, Halaseh FF, Blair S, Rivera RJ

Novel Evaluation Metric and Quantified Performance of ChatGPT-4 Patient Management Simulations for Early Clinical Education: Experimental Study

JMIR Form Res 2025;9:e66478

DOI: 10.2196/66478

PMID: 40013991

PMCID: 11884304

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.