Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Education

Date Submitted: Dec 18, 2024
Date Accepted: Sep 30, 2025
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Enhancing Large Language Models for Improved Accuracy and Safety in Medical Question Answering: Comparative Study

Wang D, Ye J, Li J, Liang J, Zhang Q, Hu Q, Pan C, Wang D, Liu Z, Shi W, Guo M, Li F, Zheng Y

Enhancing Large Language Models for Improved Accuracy and Safety in Medical Question Answering: Comparative Study

JMIR Med Educ 2025;11:e70190

DOI: 10.2196/70190

PMID: 41329953

PMCID: 12709156

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Med-RISE - Enhancing Large Language Models for Improved Accuracy and Safety in Medical Question-Answering: Comparative Study

  • Dingqiao Wang; 
  • Jinguo Ye; 
  • Jingni Li; 
  • Jiangbo Liang; 
  • Qikai Zhang; 
  • Qiuling Hu; 
  • Caineng Pan; 
  • Dongliang Wang; 
  • Zhong Liu; 
  • Wen Shi; 
  • Mengxiang Guo; 
  • Fei Li; 
  • Yingfeng Zheng

ABSTRACT

Background:

Large Language Models (LLMs) offer the potential to improve virtual patient-physician communication and reduce healthcare professionals' workload. However, limitations in accuracy, outdated knowledge, and safety issues restrict their effective use in real clinical settings. Addressing these challenges is crucial for making LLMs a reliable healthcare tool.

Objective:

This study aims to evaluate the efficacy of Med-RISE, an information retrieval and augmentation tool, in comparison with baseline Large Language Models, focusing on enhancing accuracy and safety in medical question answering across diverse clinical domains.

Methods:

This comparative study introduces Med-RISE, an enhanced version of the Retrieval-Augmented Generation (RAG) framework, specifically designed to improve question-answering performance across wide-ranging medical domains and diverse disciplines. Med-RISE consists of four key steps: Query rewriting, Information retrieval (providing local and real-time retrieval), Summarization, and Execution (a fact and safety filter before output). The study integrated Med-RISE with four LLMs (GPT-3.5, GPT-4, Vicuna-13B, and ChatGLM-6B) and assessed their performance on four multiple-choice medical question datasets: MedQA (USMLE), PubMedQA (original and revised versions), MedMCQA, and EYE500. Primary outcome measures included answer accuracy and hallucination rates, with hallucinations categorized as factuality (inaccurate information) or faithfulness (inconsistency with instructions) types. The study was performed between March and August 2024.

Results:

The integration of Med-RISE with each LLM led to a substantial increase in accuracy, with an average improvement of 13.0% across four datasets: MedQA (USMLE), PubMedQA (Revised version), MedMCQA, and EYE500. The enhanced accuracy rates were 16.3% for GPT-3.5, 12.9% for GPT-4, 13.0% for Vicuna-13B, and 9.9% for ChatGLM-6B. Additionally, Med-RISE effectively reduced hallucinations by 15.0%, with factuality hallucinations decreasing by 13.5% and faithfulness hallucinations by 5.8%. The average hallucination rate reductions were 17.6% for GPT-3.5, 12.8% for GPT-4, 18.0% for Vicuna-13B, and 11.8% for ChatGLM-6B.

Conclusions:

The Med-RISE framework significantly improves accuracy and reduces hallucinations of LLMs in medical question answering across benchmark datasets. By providing local and real-time information retrieval, fact and safety filtering, Med-RISE enhances the reliability and interpretability of LLMs in the medical domain, offering a promising tool for clinical practice and decision support.


 Citation

Please cite as:

Wang D, Ye J, Li J, Liang J, Zhang Q, Hu Q, Pan C, Wang D, Liu Z, Shi W, Guo M, Li F, Zheng Y

Enhancing Large Language Models for Improved Accuracy and Safety in Medical Question Answering: Comparative Study

JMIR Med Educ 2025;11:e70190

DOI: 10.2196/70190

PMID: 41329953

PMCID: 12709156

Per the author's request the PDF is not available.