Currently submitted to: JMIR Medical Education
Date Submitted: Mar 25, 2026
Open Peer Review Period: Mar 26, 2026 - May 21, 2026
(currently open for review)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Large Language Models for the Classification of Verbal Communication in Dementia Care
ABSTRACT
Background:
Effective verbal communication is a core component of nursing care, particularly in dementia care such as Humanitude. However, manual evaluation of communication quality is time-consuming, subjective, and difficult to scale in training settings. Large language models (LLMs) may enable automated and scalable analysis of verbal communication in caregiving.
Objective:
This study evaluated whether LLMs can reliably classify verbal communication in nursing care training sessions and detect differences in communication patterns across caregiver expertise levels.
Methods:
Care sessions involving simulated patients were conducted with 18 participants, including Humanitude instructors, intermediate practitioners, and novice nurses. Audio recordings were transcribed, segmented into utterances, and classified into 6 communication categories: positive/affectionate expression, request/suggestion, gratitude, explanation, question/confirmation, and none. Four human annotators independently labeled the utterances, and the same transcripts were analyzed using GPT, Claude, and Gemini. Agreement was evaluated using pairwise agreement rates and Cohen’s kappa coefficients. Model performance was further assessed against consensus labels derived from multiple annotators, and non-inferiority/equivalence was tested using two one-sided tests (TOST).
Results:
Inter-annotator agreement among the human annotators was moderate, with pairwise agreement rates ranging from 64.44% to 74.21% and Cohen’s kappa values ranging from 0.554 to 0.664. Among the evaluated LLMs, Claude showed the highest agreement with human annotations, followed by Gemini and GPT. Against consensus labels, Claude achieved the highest accuracy (0.836 for ≥2-annotator consensus; 0.902 for ≥3-annotator consensus), followed by Gemini (0.779; 0.837) and GPT (0.672; 0.732). TOST analysis showed that Gemini achieved statistical equivalence with human annotation (p=0.040), while Claude demonstrated non-inferiority and exceeded the human baseline (p=0.001). Across caregiver groups, instructors showed a higher proportion of positive/affectionate expressions, whereas novice caregivers showed a higher proportion of task-oriented and uncategorized utterances. Overall, LLM-based classification reproduced the general communication patterns observed in human annotations.
Conclusions:
LLM-based classification demonstrated reliability comparable to human annotation for caregiving communication analysis. Claude showed the strongest overall performance, and Gemini achieved statistical equivalence with human annotation. These findings suggest that LLM-based analysis may provide a scalable and objective approach to assessing communication behaviors in Humanitude training and support communication assessment in nursing and medical education. Clinical Trial: Gunma University Hospital (HS2024-044)
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.