Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Apr 11, 2024
Date Accepted: Nov 11, 2024

The final, peer-reviewed published version of this preprint can be found here:

Application of Large Language Models in Medical Training Evaluation—Using ChatGPT as a Standardized Patient: Multimetric Assessment

Wang C, Li S, Lin N, Zhang X, Han Y, Wang X, Liu D, Tan X, Pu D, Li K, Qian G, Yin R

Application of Large Language Models in Medical Training Evaluation—Using ChatGPT as a Standardized Patient: Multimetric Assessment

J Med Internet Res 2025;27:e59435

DOI: 10.2196/59435

PMID: 39742453

PMCID: 11736217

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Application of Large Language Models in Medical Training Evaluation: Can ChatGPT Be a Standardized Patient? An Exploratory Study

  • Chenxu Wang; 
  • Shuhan Li; 
  • Nuoxi Lin; 
  • Xinyu Zhang; 
  • Ying Han; 
  • Xiandi Wang; 
  • Di Liu; 
  • Xiaomei Tan; 
  • Dan Pu; 
  • Kang Li; 
  • Guangwu Qian; 
  • Rong Yin

ABSTRACT

Background:

With the increasing interest of Large Language Models’ (LLMs) application in the medical field, the feasibility of its potential usage as a Standardized Patient (SP) in medical assessment is rarely evaluated. Specifically, we delved into the potential of using ChatGPT, a representative LLM, in transforming medical education by serving as a cost-effective alternative to SPs, specifically for history-taking tasks.

Objective:

The study aims to explore ChatGPT's viability and performance as an SP, employing prompt engineering to refine its accuracy and utility in medical assessments.

Methods:

A two-phase experiment was designed to assess ChatGPT's viability as an SP in medical education. The first phase tested the feasibility through simulating conversations on Inflammatory Bowel Disease (IBD), categorizing responses into poor, medium, and good inquiries based on relevance and accuracy. For the second phase, a more structured experiment used detailed scripts to evaluate ChatGPT's performance against specific criteria, focusing on its anthropomorphism, clinical accuracy, and adaptability. Adjustments were made to prompts based on ChatGPT's response shortcomings, with a comparative analysis of ChatGPT’s performance between original and revised prompts to track improvements. The methodology included statistical analysis to ensure rigorous evaluation, with data collected between November and December 2023.

Results:

The feasibility test confirmed ChatGPT's ability to simulate an SP effectively, differentiating between poor, medium, and good medical inquiries with varying degrees of accuracy. Score differences between the poor (74.7, SD=5.44) and medium (82.67, SD=5.30) inquiry groups (P < .001), between the poor and good (85, SD=3.27) inquiry groups (P < .001) were significant at a significance level of α = .05, while the score differences between the medium and good inquiry groups were not statistically significant (P= .158). The feasibility test took 90 runs. However, the performance is not ideal without proper prompt restriction. Subsequent performance enhancements, including the use of revised prompts, instructed ChatGPT to avoid medical jargon for realism, provide accurate and concise responses for clinical accuracy, and improve its grading accuracy and adaptability by following specific prompts. The total number of trials in the second experimental phase was 300. The revised prompt significantly improved ChatGPT's realism, clinical accuracy, and adaptability, leading to a marked reduction in scoring discrepancies. The score accuracy of ChatGPT improved 4.926 times compared to unrevised prompt. The score difference percentage (SDP) drops from 29.83% to 6.06%, with a drop in standard deviation from 0.55 to 0.068.

Conclusions:

ChatGPT, as a representative LLM, is a viable tool for simulating SPs in medical assessments, with the potential to enhance medical training. By incorporating detailed and targeted prompts, ChatGPT's scoring accuracy and response realism significantly improve, approaching the feasibility of actual clinical use. However, despite promising outcomes, continuous refinement is essential to fully establish LLM’s (such as ChatGPT) reliability in clinical assessment settings.


 Citation

Please cite as:

Wang C, Li S, Lin N, Zhang X, Han Y, Wang X, Liu D, Tan X, Pu D, Li K, Qian G, Yin R

Application of Large Language Models in Medical Training Evaluation—Using ChatGPT as a Standardized Patient: Multimetric Assessment

J Med Internet Res 2025;27:e59435

DOI: 10.2196/59435

PMID: 39742453

PMCID: 11736217

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.