Currently submitted to: Journal of Medical Internet Research
Date Submitted: Apr 16, 2026
Open Peer Review Period: Apr 17, 2026 - Jun 12, 2026
(currently open for review)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Blinded Multi-Rater Comparative Evaluation of a Large Language Model and Clinician-Authored Responses in CGM-Informed Diabetes Counseling
ABSTRACT
Background:
Continuous glucose monitoring (CGM) is central to modern diabetes care, but explaining CGM patterns clearly, consistently, and empathetically remains time-intensive in practice. Large language model (LLM)–based systems may support patient-facing interpretation of CGM data, but evidence remains limited for retrieval-grounded tools evaluated against clinician-authored responses in counseling scenarios. The system was intended for structured CGM interpretation and communication support rather than autonomous therapeutic decision making.
Objective:
To evaluate whether a retrieval-grounded LLM-based conversational agent (CA) could support patient understanding of CGM data and preparation for routine diabetes consultations by generating responses to questions arising during CGM-informed diabetes counseling, with quality comparable to clinician-authored responses.
Methods:
We developed a retrieval-grounded LLM-based CA for CGM interpretation and diabetes counseling support. The system was designed to provide plain-language explanations of CGM patterns and responses to diabetes management questions while avoiding directive or individualized medical advice, such as recommending medication initiation, dose adjustment, or regimen changes. 12 CGM-informed cases, each comprising a de-identified CGM trace, a synthetic patient vignette, and accompanying CGM visual materials, were constructed from publicly available clinical datasets. Between Oct 2025 and Feb 2026, six senior UK diabetes clinicians each reviewed 2 assigned cases and answered 24 questions (12 per case). In a blinded multi-rater evaluation, each CA-generated and clinician-authored response was independently rated by 3 clinicians on 6 quality dimensions: clinical accuracy, guideline adherence, actionability, personalization, communication clarity, and empathy. Safety flags and perceived source labels were also recorded. The primary analysis used linear mixed-effects models with random intercepts for case and rater.
Results:
A total of 288 unique responses (144 CA and 144 clinician responses) were evaluated, generating 864 ratings. The CA received higher quality scores than clinician responses (mean 4.37 vs 3.58), with an estimated mean difference of 0.782 points on a 5-point scale (95% CI 0.692-0.872; P<.001). This pattern was observed across all 6 categories of patient questions. The largest estimated differences were for empathy (mean difference 1.062, 95% CI 0.948-1.177) and actionability (0.992, 95% CI 0.877-1.106). Safety flag distributions were similar between CA and clinician responses, with major concerns rare in both groups (3/432, 0.7% each). Although CA responses were longer, additional analyses adjusting for word count did not indicate that response length explained the overall quality difference.
Conclusions:
Retrieval-grounded LLM-based systems may have value as adjunct tools for routine CGM review, patient education, and preconsultation preparation, with potential to reduce clinician time spent on standardized interpretive tasks. However, these findings should be interpreted in light of the vignette-based design, restricted datasets, and a small clinician panel, and they do not establish suitability for autonomous therapeutic decision-making, medication adjustment, or unsupervised real-world use. Prospective validation in interactive clinical workflows is needed before implementation.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.