Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Previously submitted to: JMIR AI (no longer under consideration since Mar 27, 2025)

Date Submitted: Feb 26, 2024

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Using large language models to evaluate the offer of options in clinical encounters by focusing on an item of the Observer OPTION-5 measure of shared decision-making

  • Sai Prabhakar Pandi Selvaraj; 
  • Renata West Yen; 
  • Rachel Forcino; 
  • Glyn Elwyn

ABSTRACT

Introduction: Human assessment of clinical encounter recordings using observer-based measures of shared decision-making, such as Observer OPTION-5 (OO5), is expensive. In this study, we aimed to assess the potential of using large language models (LLMs) to automate the rating of the OO5 item focused on offering options (item 1).

Methods:

We used a dataset of 287 clinical encounter transcripts of women diagnosed with early breast talking with their surgeon to discuss treatments. Each transcript had been previously scored by two researchers using OO5 (0 to 4 scale). We set up two rules-based baselines, one random and one using trigger words, and classified option talk instances using GPT-3.5 Turbo, GPT-4, and PaLM 2. To develop and compare the performance of these models, we randomly selected 16 transcripts for additional human annotation focusing on option talk instances (binary). To assess performance, we calculated Spearman correlations (rS) between the researcher-generated scores for item 1 for the remaining 271 transcripts and the item 1 instances predicted by the LLMs.

Results:

We observed high levels of correlation between the LLMs and researcher-generated scores. GPT-3.5 Turbo with a few-shot example had an rS=0.60 (P<.001) with the mean of the two scorers. Other LLMs had slightly lower correlation levels. Discussion: The LLMs, particularly GPT-3.5 Turbo with few-shot examples, demonstrated superior performance in identifying option talk instances compared to baseline models. GPT-3.5 Turbo demonstrated the best performance, achieving higher precision and recall.

Conclusions:

Further improvements in score correlations may be possible through improvements in and better understanding of LLMs.


 Citation

Please cite as:

Pandi Selvaraj SP, Yen RW, Forcino R, Elwyn G

Using large language models to evaluate the offer of options in clinical encounters by focusing on an item of the Observer OPTION-5 measure of shared decision-making

JMIR Preprints. 26/02/2024:57790

DOI: 10.2196/preprints.57790

URL: https://preprints.jmir.org/preprint/57790

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.