Accepted for/Published in: JMIR Medical Education
Date Submitted: Sep 1, 2025
Date Accepted: May 8, 2026
Using Actual Clinical Transcripts to Evaluate Paediatric Behaviour Guidance of Students by Faculty and Large Language Models: A Pilot Comparative Study
ABSTRACT
Background:
Personalised feedback improves clinical paediatric behaviour guidance performance of students but is prohibitively time-consuming to conduct. Large Language Models (LLMs) can automate the process of evaluating clinical sessions but is limited to text-only input and consistency issues.
Objective:
This study compared using text-only transcripts against video-recordings for evaluating clinical behaviour guidance performance of dental students. Additionally, the consistency and accuracy of LLMs at evaluating the transcripts were compared against a human assessor.
Methods:
The study was conducted using video-recorded clinical encounters of final-year dental students managing patients aged between 4 and 12 years-old at the Faculty of Dentistry, National University of Singapore. The videos were scored using a previously validated paediatric behaviour guidance scale. Forty clinical encounters were transcribed verbatim and scored by a study member using a modified version of the scale (non-verbal components removed). The time taken to rate the transcripts were recorded. Video scores were compared with transcript scores. Both free-to-use and paid versions of an LLM (ChatGPT) were used to score the transcripts, consistency evaluated and compared with the human assessor.
Results:
Average time taken to rate the transcripts (12 minutes) were significantly (p<0.001) less than video length (73 minutes). Comparing transcript scores with video scores resulted in a consistency intraclass correlation coefficient (ICC) of 0.830 [95%CI: (0.679-0.910), p<0.001], demonstrating good reliability. Comparing transcript scores with LLM (free-to-use) and LLM (paid) scores yielded an absolute agreement ICC of 0.729 [95%CI: (0.475-0.859), p<0.001] and 0.670 [95%CI: (0.377-0.825), p<0.001] respectively, demonstrating moderate agreement. The LLMs were inconsistent, producing variable scores with the same prompt. The free-to use and paid versions produced the same score for all three runs in only 12.5% and 10% of the clinical encounters respectively.
Conclusions:
Using transcripts to evaluate students’ clinical behaviour guidance was time-saving for faculty, demonstrated good agreement with video-based evaluation, and could be used to improve clinical teaching. While LLMs could automate the task, improvements are needed to improve their consistency and accuracy.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.