Enhancing Large Language Models for Improved Accuracy and Safety in Medical Question Answering: Comparative Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Med-RISE - Enhancing Large Language Models for Improved Accuracy and Safety in Medical Question-Answering: Comparative Study

Dingqiao Wang;
Jinguo Ye;
Jingni Li;
Jiangbo Liang;
Qikai Zhang;
Qiuling Hu;
Caineng Pan;
Dongliang Wang;
Zhong Liu;
Wen Shi;
Mengxiang Guo;
Fei Li;
Yingfeng Zheng

ABSTRACT

Background:

Large Language Models (LLMs) offer the potential to improve virtual patient-physician communication and reduce healthcare professionals' workload. However, limitations in accuracy, outdated knowledge, and safety issues restrict their effective use in real clinical settings. Addressing these challenges is crucial for making LLMs a reliable healthcare tool.

Objective:

This study aims to evaluate the efficacy of Med-RISE, an information retrieval and augmentation tool, in comparison with baseline Large Language Models, focusing on enhancing accuracy and safety in medical question answering across diverse clinical domains.

Methods:

This comparative study introduces Med-RISE, an enhanced version of the Retrieval-Augmented Generation (RAG) framework, specifically designed to improve question-answering performance across wide-ranging medical domains and diverse disciplines. Med-RISE consists of four key steps: Query rewriting, Information retrieval (providing local and real-time retrieval), Summarization, and Execution (a fact and safety filter before output). The study integrated Med-RISE with four LLMs (GPT-3.5, GPT-4, Vicuna-13B, and ChatGLM-6B) and assessed their performance on four multiple-choice medical question datasets: MedQA (USMLE), PubMedQA (original and revised versions), MedMCQA, and EYE500. Primary outcome measures included answer accuracy and hallucination rates, with hallucinations categorized as factuality (inaccurate information) or faithfulness (inconsistency with instructions) types. The study was performed between March and August 2024.

Results:

The integration of Med-RISE with each LLM led to a substantial increase in accuracy, with an average improvement of 13.0% across four datasets: MedQA (USMLE), PubMedQA (Revised version), MedMCQA, and EYE500. The enhanced accuracy rates were 16.3% for GPT-3.5, 12.9% for GPT-4, 13.0% for Vicuna-13B, and 9.9% for ChatGLM-6B. Additionally, Med-RISE effectively reduced hallucinations by 15.0%, with factuality hallucinations decreasing by 13.5% and faithfulness hallucinations by 5.8%. The average hallucination rate reductions were 17.6% for GPT-3.5, 12.8% for GPT-4, 18.0% for Vicuna-13B, and 11.8% for ChatGLM-6B.

Conclusions:

The Med-RISE framework significantly improves accuracy and reduces hallucinations of LLMs in medical question answering across benchmark datasets. By providing local and real-time information retrieval, fact and safety filtering, Med-RISE enhances the reliability and interpretability of LLMs in the medical domain, offering a promising tool for clinical practice and decision support.

Citation

Please cite as:

Wang D, Ye J, Li J, Liang J, Zhang Q, Hu Q, Pan C, Wang D, Liu Z, Shi W, Guo M, Li F, Zheng Y

Enhancing Large Language Models for Improved Accuracy and Safety in Medical Question Answering: Comparative Study

JMIR Med Educ 2025;11:e70190

DOI: 10.2196/70190

PMID: 41329953

PMCID: 12709156

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Education

Date Submitted: Dec 18, 2024

Date Accepted: Sep 30, 2025

(closed for review but you can still tweet)

Med-RISE - Enhancing Large Language Models for Improved Accuracy and Safety in Medical Question-Answering: Comparative Study

ABSTRACT

Citation

JMIR Preprints

Accepted for/Published in: JMIR Medical Education

Date Submitted: Dec 18, 2024

Date Accepted: Sep 30, 2025

(closed for review but you can still tweet)

Med-RISE - Enhancing Large Language Models for Improved Accuracy and Safety in Medical Question-Answering: Comparative Study

ABSTRACT

Citation

Per the author's request the PDF is not available.