Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Formative Research

Date Submitted: Sep 27, 2025
Date Accepted: Jan 28, 2026

The final, peer-reviewed published version of this preprint can be found here:

Retrieval-Augmented Generation for Medical Question Answering on a Heart Failure Dataset: Performance Analysis

Zhang S, Phan E, Velmovitsky P, Pham Q, Sanner S

Retrieval-Augmented Generation for Medical Question Answering on a Heart Failure Dataset: Performance Analysis

JMIR Form Res 2026;10:e84932

DOI: 10.2196/84932

PMID: 41747226

PMCID: 12945362

Retrieval-Augmented Generation for Medical Question Answering on a Heart Failure Dataset: Performance Analysis

  • Shiran Zhang; 
  • Evelyn Phan; 
  • Pedro Velmovitsky; 
  • Quynh Pham; 
  • Scott Sanner

ABSTRACT

Background:

The integration of Retrieval-Augmented Generation (RAG) systems into the domain of medical question-answering (QA) presents a significant opportunity to enhance the effectiveness and accuracy of clinical support systems.

Objective:

This study aimed to explore the design choices within the RAG framework and the use of large language model (LLM) classifiers to optimize medical QA systems, enhancing response quality for patient and caregiver queries of varying risk levels.

Methods:

In total, we curated a dataset of 109 patient and caregiver questions related to heart failure–categorized into answerable (direct, fact-based queries), helpful deferral (general guidance or lifestyle advisory queries), and non-answerable (out-of-scope or high-risk/medical intervention queries) types–along with relevant documents and a target answer for each question from the website The Heart Hub. Applying a system architecture leveraging RAG with a structured query taxonomy and robust classification mechanisms, this paper provided an empirical assessment for medical QA on a heart failure dataset and introduced a question-answering system pipeline design, providing a foundation for extended application across various medical fields. Specifically, we evaluated design choices in the initial retrieval stage of RAG and their impact on performance. We assessed final answer quality from the generation stage using popular passage scoring methods for QA, such as ROUGE, BERTScore, and Intersection over Union (IOU) score.

Results:

The pipeline first employs an LLM-based classifier, achieving 65% accuracy for answerable and helpful deferral queries and 100% accuracy for identifying non-answerable queries. In information retrieval (IR), the bioMedical Contrastive Pre-trained Transformers (MedCPT) cross encoder performed best as a dense retrieval method, delivering an average of 93% recall @ 7 through ranked relevance scores to obtain the top documents. For further retrieving snippets from such documents, its average performance was 72.5% for sentence-level snippets and 83% for paragraph-level snippets. A second LLM-based classifier, used to refine the generated responses, resulted in an overall reduction in ROUGE-1 recall by 13% and BERT precision by 11%. However, IoU scores, or the overlap between “gold answers” and system answers, increased by 24%, demonstrating enhanced alignment with ground truth responses. This also indicates the system’s improved ability to generate concise and accurate medical responses.

Conclusions:

The implementation of a structured RAG framework paired with LLM classifiers for medical QA introduces a promising avenue for enhancing clinical decision support systems. By systematically analyzing the impact of query taxonomy, retrieval configurations, and response strategies, this approach clarifies the relative importance of each component within the medical RAG system using a heart failure dataset. Our findings provide actionable guidance on optimal design choices for maximizing retrieval and response accuracy, thus informing the development of robust, scalable medical QA systems.


 Citation

Please cite as:

Zhang S, Phan E, Velmovitsky P, Pham Q, Sanner S

Retrieval-Augmented Generation for Medical Question Answering on a Heart Failure Dataset: Performance Analysis

JMIR Form Res 2026;10:e84932

DOI: 10.2196/84932

PMID: 41747226

PMCID: 12945362

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.