Accepted for/Published in: JMIR Medical Education
Date Submitted: Dec 18, 2024
Date Accepted: Sep 30, 2025
(closed for review but you can still tweet)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Med-RISE - Enhancing Large Language Models for Improved Accuracy and Safety in Medical Question-Answering: Comparative Study
ABSTRACT
Background:
Large Language Models (LLMs) offer the potential to improve virtual patient-physician communication and reduce healthcare professionals' workload. However, limitations in accuracy, outdated knowledge, and safety issues restrict their effective use in real clinical settings. Addressing these challenges is crucial for making LLMs a reliable healthcare tool.
Objective:
This study aims to evaluate the efficacy of Med-RISE, an information retrieval and augmentation tool, in comparison with baseline Large Language Models, focusing on enhancing accuracy and safety in medical question answering across diverse clinical domains.
Methods:
This comparative study introduces Med-RISE, an enhanced version of the Retrieval-Augmented Generation (RAG) framework, specifically designed to improve question-answering performance across wide-ranging medical domains and diverse disciplines. Med-RISE consists of four key steps: Query rewriting, Information retrieval (providing local and real-time retrieval), Summarization, and Execution (a fact and safety filter before output). The study integrated Med-RISE with four LLMs (GPT-3.5, GPT-4, Vicuna-13B, and ChatGLM-6B) and assessed their performance on four multiple-choice medical question datasets: MedQA (USMLE), PubMedQA (original and revised versions), MedMCQA, and EYE500. Primary outcome measures included answer accuracy and hallucination rates, with hallucinations categorized as factuality (inaccurate information) or faithfulness (inconsistency with instructions) types. The study was performed between March and August 2024.
Results:
The integration of Med-RISE with each LLM led to a substantial increase in accuracy, with an average improvement of 13.0% across four datasets: MedQA (USMLE), PubMedQA (Revised version), MedMCQA, and EYE500. The enhanced accuracy rates were 16.3% for GPT-3.5, 12.9% for GPT-4, 13.0% for Vicuna-13B, and 9.9% for ChatGLM-6B. Additionally, Med-RISE effectively reduced hallucinations by 15.0%, with factuality hallucinations decreasing by 13.5% and faithfulness hallucinations by 5.8%. The average hallucination rate reductions were 17.6% for GPT-3.5, 12.8% for GPT-4, 18.0% for Vicuna-13B, and 11.8% for ChatGLM-6B.
Conclusions:
The Med-RISE framework significantly improves accuracy and reduces hallucinations of LLMs in medical question answering across benchmark datasets. By providing local and real-time information retrieval, fact and safety filtering, Med-RISE enhances the reliability and interpretability of LLMs in the medical domain, offering a promising tool for clinical practice and decision support.
Citation