Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Currently submitted to: JMIR Medical Education

Date Submitted: Mar 25, 2026
Open Peer Review Period: Mar 26, 2026 - May 21, 2026
(currently open for review)

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Large Language Models for the Classification of Verbal Communication in Dementia Care

  • Atsushi Nakazawa; 
  • Yuto Hogan; 
  • Yuri Nagai; 
  • Mio Ito

ABSTRACT

Background:

Effective verbal communication is a core component of nursing care, particularly in dementia care such as Humanitude. However, manual evaluation of communication quality is time-consuming, subjective, and difficult to scale in training settings. Large language models (LLMs) may enable automated and scalable analysis of verbal communication in caregiving.

Objective:

This study evaluated whether LLMs can reliably classify verbal communication in nursing care training sessions and detect differences in communication patterns across caregiver expertise levels.

Methods:

Care sessions involving simulated patients were conducted with 18 participants, including Humanitude instructors, intermediate practitioners, and novice nurses. Audio recordings were transcribed, segmented into utterances, and classified into 6 communication categories: positive/affectionate expression, request/suggestion, gratitude, explanation, question/confirmation, and none. Four human annotators independently labeled the utterances, and the same transcripts were analyzed using GPT, Claude, and Gemini. Agreement was evaluated using pairwise agreement rates and Cohen’s kappa coefficients. Model performance was further assessed against consensus labels derived from multiple annotators, and non-inferiority/equivalence was tested using two one-sided tests (TOST).

Results:

Inter-annotator agreement among the human annotators was moderate, with pairwise agreement rates ranging from 64.44% to 74.21% and Cohen’s kappa values ranging from 0.554 to 0.664. Among the evaluated LLMs, Claude showed the highest agreement with human annotations, followed by Gemini and GPT. Against consensus labels, Claude achieved the highest accuracy (0.836 for ≥2-annotator consensus; 0.902 for ≥3-annotator consensus), followed by Gemini (0.779; 0.837) and GPT (0.672; 0.732). TOST analysis showed that Gemini achieved statistical equivalence with human annotation (p=0.040), while Claude demonstrated non-inferiority and exceeded the human baseline (p=0.001). Across caregiver groups, instructors showed a higher proportion of positive/affectionate expressions, whereas novice caregivers showed a higher proportion of task-oriented and uncategorized utterances. Overall, LLM-based classification reproduced the general communication patterns observed in human annotations.

Conclusions:

LLM-based classification demonstrated reliability comparable to human annotation for caregiving communication analysis. Claude showed the strongest overall performance, and Gemini achieved statistical equivalence with human annotation. These findings suggest that LLM-based analysis may provide a scalable and objective approach to assessing communication behaviors in Humanitude training and support communication assessment in nursing and medical education. Clinical Trial: Gunma University Hospital (HS2024-044)


 Citation

Please cite as:

Nakazawa A, Hogan Y, Nagai Y, Ito M

Large Language Models for the Classification of Verbal Communication in Dementia Care

JMIR Preprints. 25/03/2026:95668

DOI: 10.2196/preprints.95668

URL: https://preprints.jmir.org/preprint/95668

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.