Currently submitted to: JMIR Medical Informatics
Date Submitted: Mar 3, 2026
Open Peer Review Period: Mar 12, 2026 - May 7, 2026
(currently open for review)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Evaluating Dutch-Language Ambient Listening in Simulated Clinical Encounters: Comparing Three Providers in a Multi-Speaker, Multi-Dialect Study
ABSTRACT
Background:
Clinicians spent a lot of time on Electronic Health Record (EHR) documentation, often at the expense of patient interaction. Ambient listening technology uses artificial intelligence to passively record and summarize clinical encounters. While initial studies are promising, there is limited evidence on system performance in complex, non-English settings.
Objective:
To compare the documentation performance of three commercially available ambient listening systems in simulated Dutch-language outpatient consultations by assessing note completeness, correctness, and conciseness under predefined linguistic and interactional challenges.
Methods:
Standardized audio recordings of ten scripted physician–patient interactions in two specialties were used. Scenarios included multi-speaker dynamics (patient companion), conversational disruptions (nurse interruption), evasive patient communication, and a regional dialect (Gronings). Three distinct AI documentation systems (Provider A, Provider B, and Provider C) processed the audio files. Eight human raters evaluated the resulting AI-generated notes against reference summaries for Completeness, Conciseness, and Correctness using a 5-point ordinal scale. Inter-rater agreement was assessed using Gwet’s AC2. System-level technical characteristics were assessed alongside clinical performance to aid interpretation of between-vendor differences.
Results:
Across 351 ratings on a 1-5 scale, the overall inter-rater agreement was high (Gwet’s AC2 = 0.827). Mean scores were tightly clustered across providers (Provider C: 4.26, Provider B: 4.00, Provider A: 3.82). Mean scores were higher in Otolaryngology (mean 4.36) than Surgical Oncology (mean 3.68). Across scoring domains, correctness received the highest mean score (4.21), while completeness received lowest (3.81). Variation in mean scores was observed across script scenarios. Dialect-specific scenarios showed the lowest mean score (3.77) and the greatest variability across providers. Median summary generation times ranged from 13.5 seconds (Provider C) to 22.0 seconds (Provider B).
Conclusions:
Ambient listening systems demonstrate good performance in Dutch clinical settings, even under conditions simulating common conversational challenges. While accuracy is generally high, performance is sensitive to linguistic variation. Future deployment studies must prioritize linguistic equity, real-world validation of efficiency gains, and evaluation of both clinician and patient perception to understand how these systems influence consultation dynamics and care delivery across diverse patient populations.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.