Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Education

Date Submitted: Apr 5, 2024
Date Accepted: Jun 27, 2024

The final, peer-reviewed published version of this preprint can be found here:

A Language Model–Powered Simulated Patient With Automated Feedback for History Taking: Prospective Study

Holderried F, Stegemann–Philipps C, Herrmann-Werner A, Festl-Wietek T, Holderried M, Eickhoff C, Mahling M

A Language Model–Powered Simulated Patient With Automated Feedback for History Taking: Prospective Study

JMIR Med Educ 2024;10:e59213

DOI: 10.2196/59213

PMID: 39150749

PMCID: 11364946

A language model-powered simulated patient with automated feedback for history taking: Prospective study

  • Friederike Holderried; 
  • Christian Stegemann–Philipps; 
  • Anne Herrmann-Werner; 
  • Teresa Festl-Wietek; 
  • Martin Holderried; 
  • Carsten Eickhoff; 
  • Moritz Mahling

ABSTRACT

Background:

History taking is fundamental in diagnosing medical conditions, but teaching and providing feedback on this skill can be challenging due to constraints on patient and staff resources. Virtual simulated patients and web-based chatbots have emerged as educational tools, with recent advancements in artificial intelligence such as large language models (LLMs) enhancing their realism and potential to provide feedback.

Objective:

This study aimed to evaluate the effectiveness of a Generative Pre-trained Transformer 4 (GPT-4) model to provide structured feedback on medical students' performance in history-taking with a simulated patient.

Methods:

We conducted a prospective study involving medical students performing history taking with a GPT-powered chatbot. Therefore, we designed a chatbot to simulate patient responses and provide immediate feedback on the comprehensiveness of the students’ history taking. Students’ interactions were analysed, and feedback from the chatbot was compared to that of a human rater. We measured inter-rater reliability and performed a descriptive analysis to assess the quality of feedback.

Results:

The study included 106 participants, most of them in their third year of medical school. A total of 1,894 question-answer pairs (QAPs) from 106 conversations were included in the analysis. GPT-4’s roleplay and responses were medically plausible in over 99% of cases. Inter-rater reliability between GPT-4 and the human rater showed an “almost perfect” agreement (Cohen’s κ=0.832). A lower agreement (κ<0.6) was detected for 8 out of 45 feedback categories, highlighting areas where the model’s assessments were overly specific or diverged from human judgment.

Conclusions:

The GPT model was effective in providing structured feedback on history taking dialogue performed by medical students. While we unraveled some limitations in terms of the specificity of feedback for certain feedback categories, the overall high agreement with human raters suggests that LLMs can be a valuable tool for medical education. This study supports the integration of AI-driven feedback mechanisms in medical training and highlights important aspects when LLMs are employed in this context.


 Citation

Please cite as:

Holderried F, Stegemann–Philipps C, Herrmann-Werner A, Festl-Wietek T, Holderried M, Eickhoff C, Mahling M

A Language Model–Powered Simulated Patient With Automated Feedback for History Taking: Prospective Study

JMIR Med Educ 2024;10:e59213

DOI: 10.2196/59213

PMID: 39150749

PMCID: 11364946

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.