Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Mar 20, 2025
Date Accepted: May 16, 2025
Patient Complaints Classification using Artificial Intelligence-Powered Large Language Models: An Analytical Cross-Sectional Study
ABSTRACT
Background:
Patient complaints offer actionable insights for quality improvement and safety. Artificial intelligence (AI) can facilitate the analysis of complaints, but its accuracy in categorizing complaints requires further evaluation.
Objective:
To categorise patient complaints in primary care using the Healthcare Complaint Analysis Tool (HCAT) General Practice (GP) and evaluate AI-powered categorization of complaints.
Methods:
This analytical cross-sectional study analysed 1,816 anonymous patient complaints from seven public primary care clinics in Singapore. Complaints were first coded by trained human coders using the HCAT (GP) taxonomy. Large language models (LLMs) (GPT (Generative Pre-trained Transformer )-3.5 turbo, GPT-4o mini, and Claude 3.5 Sonnet) were employed to validate manual classification and identify complaint themes. LLM classifications were assessed using accuracy, sensitivity, specificity, and F-scores. Cohen's kappa and McNemar's test evaluated AI-human agreement and compared AI model concordance.
Results:
Most complaints were related to management (59.4%) and institutional processes (45.7%), were of medium severity (54.7%), occurred within the practice (34.5%), and resulted in minimal harm (75.4%). LLM models achieved moderate to good accuracy (60.4%–95.5%) in HCAT (GP) field classifications, with GPT-4o mini generally outperforming GPT-3.5 turbo, except in severity classification. All three LLMs demonstrated moderate concordance rates (average 61.9%–68.8%) in complaints classification with varying levels of agreement (κ = 0.114–0.623). GPT-4o mini and Claude 3.5 Sonnet significantly outperformed GPT-3.5 turbo in several fields (p < 0.05). Claude’s thematic analysis identified long wait times (21.6%), staff attitudes (15.8%) and appointment booking issues (10.5%) as the top concerns, accounting for nearly half of all complaints.
Conclusions:
While GPT-4o and Claude 3.5 demonstrated promising results, further fine-tuning and model training is required to improve accuracy. Integrating AI into complaint analysis can facilitate proactive identification of systemic issues, ultimately enhancing quality improvement and patient safety.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.