JMIR Preprints #66220: Two-layer retrieval augmented generation framework for low-resource medical question-answering using Reddit data: Proof of concept

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Two-layer retrieval augmented generation framework for low-resource medical question-answering using Reddit data: Proof of concept

Sudeshna Das;
Yao Ge;
Yuting Guo;
Swati Rajwal;
JaMor Hairston;
Jeanne Powell;
Drew Walker;
Snigdha Peddireddy;
Sahithi Lakamana;
Selen Bozkurt;
Matthew Reyna;
Reza Sameni;
Yunyu Xiao;
Sangmi Kim;
Rasheeta Chandler;
Natalie Hernandez;
Danielle Mowery;
Rachel Wightman;
Jennifer Love;
Anthony Spadaro;
Jeanmarie Perrone;
Abeed Sarker

ABSTRACT

Background:

The increasing use of social media to share lived and living experiences of substance use presents a unique opportunity to obtain information on side-effects, usage patterns, and opinions on novel psychoactive substances (NPS). However, due to the large volume of data, obtaining useful insights through natural language processing (NLP) technologies such as large language models (LLMs) is challenging.

Objective:

To develop a retrieval-augmented generation (RAG) architecture for medical question answering pertaining to clinicians’ queries on emerging issues associated with health-related topics using user-generated medical information on social media.

Methods:

We proposed a two-layer RAG framework for query-focused answer generation and evaluated a proof-of-concept for the framework in the context of query-focused summary generation from social media forums, focusing on emerging drug-related information. We compared the performance of a quantized large language model (LLM), deployable in low-resource settings, with GPT-4.

Results:

Our framework achieves comparable median scores in terms of relevance, length, hallucination, coverage, and coherence when evaluated using GPT-4 and Nous-Hermes-2-7B-DPO, evaluated over 20 queries with 52 samples.

Conclusions:

Retrieval augmented generation using LLMs is useful for medical question answering in resource-constrained settings.

Citation

Please cite as:

Das S, Ge Y, Guo Y, Rajwal S, Hairston J, Powell J, Walker D, Peddireddy S, Lakamana S, Bozkurt S, Reyna M, Sameni R, Xiao Y, Kim S, Chandler R, Hernandez N, Mowery D, Wightman R, Love J, Spadaro A, Perrone J, Sarker A

Two-Layer Retrieval-Augmented Generation Framework for Low-Resource Medical Question Answering Using Reddit Data: Proof-of-Concept Study

J Med Internet Res 2025;27:e66220

DOI: 10.2196/66220

PMID: 39761554

PMCID: 11747534

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Sep 6, 2024

Date Accepted: Dec 5, 2024

Two-layer retrieval augmented generation framework for low-resource medical question-answering using Reddit data: Proof of concept

ABSTRACT

Citation

Copyright