JMIR Preprints #70080: One LLM is not Enough: Harnessing the Power of Ensemble Learning for Medical Question Answering

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

One LLM is not Enough: Harnessing the Power of Ensemble Learning for Medical Question Answering

Han Yang;
Mingchen Li;
Huixue Zhou;
Yongkang Xiao;
Qian Fang;
Shuang Zhou;
Rui Zhang

ABSTRACT

Background:

Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language processing tasks, including medical question-answering (QA). However, individual LLMs often exhibit varying performance across different medical QA datasets. We benchmarked individual zero-shot LLMs (GPT-4, Llama2-13B, Vicuna-13B, MedLlama-13B, and MedAlpaca-13B) to assess their baseline performance. Within benchmark, GPT-4 achieves best 71% on MedMCQA, Vicuna-13B achieves 89.5% on PubMedQA, and MedAlpaca-13B achieves best 70% among all, showing the potential for better performance across different tasks and highlighting the need for strategies that can harness their collective strengths. Ensemble learning methods, combining multiple models to improve overall accuracy and reliability, offer a promising approach to address this challenge.

Objective:

To develop and evaluate efficient ensemble learning approaches, we focus on improving performance across three medical QA datasets through our proposed two ensemble strategies.

Methods:

Our study employs three medical QA datasets: PubMedQA (1,000 manually labeled + 11,269 test yes/no/maybe questions), MedQA-USMLE (12,724 English board-style questions; 1,272 test, five options), and MedMCQA (182,822 training / 4,183 test questions, four-option multiple-choice). We introduced the LLM-Synergy framework, consisting of two ensemble methods: (1) a Boosting-based Weighted Majority Vote ensemble, refining decision-making by adaptively weighting each LLM, and (2) a Cluster-based Dynamic Model Selection ensemble, dynamically selecting optimal LLMs for each query based on question-context embeddings and clustering.

Results:

Both ensemble methods outperformed individual LLMs across all three datasets. Specifically comparing the best individual LLM, the Boosting-based Majority Weighted Vote achieved accuracies of 35.84% on MedMCQA (+3.81%), 96.21% on PubMedQA (+0.64%), and 37.26% (tie) on MedQA-USMLE. The Cluster-based Dynamic Model Selection yields an even higher accuracies of 38.01% (+5.98%) for MedMCQA, 96.36% (+1.09%) for PubMedQA, and 38.13% (+0.87%) for MedQA-USMLE.

Conclusions:

The LLM-Synergy framework, utilizing two ensemble methods, represents a significant advancement in leveraging LLMs for medical QA tasks. Through effectively combining the strengths of diverse LLMs, this framework provides a flexible and efficient strategy adaptable to current and future challenges in biomedical informatics.

Citation

Please cite as:

Yang H, Li M, Zhou H, Xiao Y, Fang Q, Zhou S, Zhang R

Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study

J Med Internet Res 2025;27:e70080

DOI: 10.2196/70080

PMID: 40658884

PMCID: 12337233

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Dec 14, 2024

Open Peer Review Period: Dec 15, 2024 - Feb 9, 2025

Date Accepted: May 12, 2025

(closed for review but you can still tweet)

One LLM is not Enough: Harnessing the Power of Ensemble Learning for Medical Question Answering

ABSTRACT

Citation

Copyright