Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Jun 4, 2024
Open Peer Review Period: Jun 4, 2024 - Jul 30, 2024
Date Accepted: Dec 20, 2024
(closed for review but you can still tweet)
Summarizing online patient conversations using generative language models
ABSTRACT
Background:
The World Wide Web, and online communities in particular, play an important role in providing support for patients in living with their condition. Methodologies for patient listening on social media have been proposed. Social media is acknowledged by regulatory bodies (e.g., FDA) as an important source of patient experience data to learn about patients’ unmet needs, priorities, and desired outcomes. Yet, most automatic methods to extract patient experience data from online sources are quantitative, without providing deeper qualitative insights. Thus, there is a lack of methods that support the generation of insights at large scale by automatically summarizing patient experiences shared in online fora.
Objective:
The objective of this study was to evaluate to what extent state-of-the-art large language models can be used to appropriately summarize posts shared by patients in online fora and health communities. The focus was on the experience of important disease burdens as well as with existing treatments. The study was conducted on posts from patients from 5 online fora related to breast cancer. The goal was in particular to compare the performance of different language models and prompting strategies, and investigate the feasibility of large language models as a way to generate qualitative insights about the patient experience.
Methods:
We applied three different language models (FlanT5, GPT-3 and GPT-3.5) on the task of summarizing posts from patients in online communities. The generated summaries were evaluated with respect to 124 manually created summaries as reference, using standard metrics for the evaluation of text generation methods. We tested in particular the effect of fine-tuning and 7 different strategies for prompting (Zero-shot, One-shot, Three-shot, Zero-Shot Directional Stimulus Prompting (DSP), One-shot DSP, and Three-shot DSP, Chain of Thought). As evaluation metrics, we use ROUGE and BERTScore to compare the automatically generated summaries to manually created reference summaries.
Results:
Among the zero-shot-based large language models investigated, GPT-3.5 performs better than the other models with respect to the ROUGE metrics as well as with respect to BERTScore. While Zero-shot seems to be a good prompting strategy, overall GPT-3.5 DSP Three-shot had the best results with respect to the above mentioned metrics.. A manual investigation of the summarization of the best-performing method showed that the generated summaries are accurate and plausible compared to the manual summaries.
Conclusions:
Our results suggest that state-of-the-art pre-trained large language models can be successfully used to summarize comments shared by patients in online communities to shed light on how patients experience their condition. Future work should investigate the problem of hallucinations and develop approaches to mitigate them. Future work could further investigate to what extent the proposed methods could be used to support the development and mapping of patient journeys.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.