Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Jan 17, 2024
Open Peer Review Period: Jan 18, 2024 - Mar 14, 2024
Date Accepted: Jul 5, 2024
(closed for review but you can still tweet)
Performance of Large Language Models in Patient Complaint Resolution
ABSTRACT
Background:
Patient complaints are a perennial challenge faced by healthcare institutions globally, requiring extensive time and effort from healthcare workers. Despite these efforts, patient dissatisfaction remains high. Recent studies on the utility of Large Language Models (LLMs) such as the GPT models developed by OpenAI in the healthcare sector have shown great promise, with the ability to provide more detailed and empathetic responses as compared to physicians. LLMs could potentially be utilized in responding to patient complaints to improve patient satisfaction and complaint response time.
Objective:
This study aimed to evaluate the performance of LLMs in addressing patient complaints received by a tertiary healthcare institution, with the goal of enhancing patient satisfaction.
Methods:
Anonymized patient complaint emails and associated responses from the Patient Relations Department (PRD) were obtained. ChatGPT-4.0 was provided with the same complaint email and tasked to generate a response. The complaints and the respective responses were uploaded onto a web-based questionnaire. Respondents were asked to rate both responses on a 10-point Likert scale for 4 items: appropriateness, completeness, empathy, and satisfaction. Participants were also asked to choose a preferred response at the end of each scenario.
Results:
There were a total of 188 respondents, of which 61.2% were healthcare workers. A majority of the respondents, including both healthcare and non-healthcare workers, preferred replies from ChatGPT (87.2% to 97.3%). GPT-4 responses were rated higher in all four assessed items [median scores 8 (interquartile range, IQR 7-9)] compared to human responses [appropriateness 5 (IQR 3-7), empathy 4 (IQR 3-6), quality 5 (IQR 3-6), satisfaction 5 (IQR 3-6)] (P<.001). Regression analyses showed that a higher word count significantly predicts higher scores in all 4 items (GPT-4 average wordcount 238 words compared to human responses 76 words, P<.001). However, on subgroup analysis by authorship, this only held true for responses written by PRD staff and not those generated by ChatGPT which received consistently high scores irrespective of response length.
Conclusions:
This study provides significant evidence supporting the effectiveness of LLMs in patient complaint resolution. ChatGPT demonstrated superiority in terms of response appropriateness, empathy, quality, and overall satisfaction when compared against actual human responses to patient complaints. Future research can be done to measure the degree of improvement that artificial intelligence (AI) generated responses can bring in terms of time savings, cost-effectiveness, patient satisfaction and stress reduction for the healthcare system.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.