JMIR Preprints #59213: A language model-powered simulated patient with automated feedback for history taking: Prospective study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

A language model-powered simulated patient with automated feedback for history taking: Prospective study

Friederike Holderried;
Christian Stegemann–Philipps;
Anne Herrmann-Werner;
Teresa Festl-Wietek;
Martin Holderried;
Carsten Eickhoff;
Moritz Mahling

ABSTRACT

Background:

History taking is fundamental in diagnosing medical conditions, but teaching and providing feedback on this skill can be challenging due to constraints on patient and staff resources. Virtual simulated patients and web-based chatbots have emerged as educational tools, with recent advancements in artificial intelligence such as large language models (LLMs) enhancing their realism and potential to provide feedback.

Objective:

This study aimed to evaluate the effectiveness of a Generative Pre-trained Transformer 4 (GPT-4) model to provide structured feedback on medical students' performance in history-taking with a simulated patient.

Methods:

We conducted a prospective study involving medical students performing history taking with a GPT-powered chatbot. Therefore, we designed a chatbot to simulate patient responses and provide immediate feedback on the comprehensiveness of the students’ history taking. Students’ interactions were analysed, and feedback from the chatbot was compared to that of a human rater. We measured inter-rater reliability and performed a descriptive analysis to assess the quality of feedback.

Results:

The study included 106 participants, most of them in their third year of medical school. A total of 1,894 question-answer pairs (QAPs) from 106 conversations were included in the analysis. GPT-4’s roleplay and responses were medically plausible in over 99% of cases. Inter-rater reliability between GPT-4 and the human rater showed an “almost perfect” agreement (Cohen’s κ=0.832). A lower agreement (κ<0.6) was detected for 8 out of 45 feedback categories, highlighting areas where the model’s assessments were overly specific or diverged from human judgment.

Conclusions:

The GPT model was effective in providing structured feedback on history taking dialogue performed by medical students. While we unraveled some limitations in terms of the specificity of feedback for certain feedback categories, the overall high agreement with human raters suggests that LLMs can be a valuable tool for medical education. This study supports the integration of AI-driven feedback mechanisms in medical training and highlights important aspects when LLMs are employed in this context.

Citation

Please cite as:

Holderried F, Stegemann–Philipps C, Herrmann-Werner A, Festl-Wietek T, Holderried M, Eickhoff C, Mahling M

A Language Model–Powered Simulated Patient With Automated Feedback for History Taking: Prospective Study

JMIR Med Educ 2024;10:e59213

DOI: 10.2196/59213

PMID: 39150749

PMCID: 11364946

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Education

Date Submitted: Apr 5, 2024

Date Accepted: Jun 27, 2024

A language model-powered simulated patient with automated feedback for history taking: Prospective study

ABSTRACT

Citation