JMIR Preprints #57790: Using large language models to evaluate the offer of options in clinical encounters by focusing on an item of the Observer OPTION-5 measure of shared decision-making

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Using large language models to evaluate the offer of options in clinical encounters by focusing on an item of the Observer OPTION-5 measure of shared decision-making

Sai Prabhakar Pandi Selvaraj;
Renata West Yen;
Rachel Forcino;
Glyn Elwyn

ABSTRACT

Introduction: Human assessment of clinical encounter recordings using observer-based measures of shared decision-making, such as Observer OPTION-5 (OO5), is expensive. In this study, we aimed to assess the potential of using large language models (LLMs) to automate the rating of the OO5 item focused on offering options (item 1).

Methods:

We used a dataset of 287 clinical encounter transcripts of women diagnosed with early breast talking with their surgeon to discuss treatments. Each transcript had been previously scored by two researchers using OO5 (0 to 4 scale). We set up two rules-based baselines, one random and one using trigger words, and classified option talk instances using GPT-3.5 Turbo, GPT-4, and PaLM 2. To develop and compare the performance of these models, we randomly selected 16 transcripts for additional human annotation focusing on option talk instances (binary). To assess performance, we calculated Spearman correlations (rS) between the researcher-generated scores for item 1 for the remaining 271 transcripts and the item 1 instances predicted by the LLMs.

Results:

We observed high levels of correlation between the LLMs and researcher-generated scores. GPT-3.5 Turbo with a few-shot example had an rS=0.60 (P<.001) with the mean of the two scorers. Other LLMs had slightly lower correlation levels. Discussion: The LLMs, particularly GPT-3.5 Turbo with few-shot examples, demonstrated superior performance in identifying option talk instances compared to baseline models. GPT-3.5 Turbo demonstrated the best performance, achieving higher precision and recall.

Conclusions:

Further improvements in score correlations may be possible through improvements in and better understanding of LLMs.

Citation

Please cite as:

Pandi Selvaraj SP, Yen RW, Forcino R, Elwyn G

Using large language models to evaluate the offer of options in clinical encounters by focusing on an item of the Observer OPTION-5 measure of shared decision-making

JMIR Preprints. 26/02/2024:57790

DOI: 10.2196/preprints.57790

URL: https://preprints.jmir.org/preprint/57790

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Previously submitted to: JMIR AI (no longer under consideration since Mar 27, 2025)

Date Submitted: Feb 26, 2024

Using large language models to evaluate the offer of options in clinical encounters by focusing on an item of the Observer OPTION-5 measure of shared decision-making

ABSTRACT

Citation

Copyright