Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Dec 14, 2024
Open Peer Review Period: Dec 15, 2024 - Feb 9, 2025
Date Accepted: May 12, 2025
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study

Yang H, Li M, Zhou H, Xiao Y, Fang Q, Zhou S, Zhang R

Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study

J Med Internet Res 2025;27:e70080

DOI: 10.2196/70080

PMID: 40658884

PMCID: 12337233

One LLM is not Enough: Harnessing the Power of Ensemble Learning for Medical Question Answering

  • Han Yang; 
  • Mingchen Li; 
  • Huixue Zhou; 
  • Yongkang Xiao; 
  • Qian Fang; 
  • Shuang Zhou; 
  • Rui Zhang

ABSTRACT

Background:

Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language processing tasks, including medical question-answering (QA). However, individual LLMs often exhibit varying performance across different medical QA datasets. We benchmarked individual zero-shot LLMs (GPT-4, Llama2-13B, Vicuna-13B, MedLlama-13B, and MedAlpaca-13B) to assess their baseline performance. Within benchmark, GPT-4 achieves best 71% on MedMCQA, Vicuna-13B achieves 89.5% on PubMedQA, and MedAlpaca-13B achieves best 70% among all, showing the potential for better performance across different tasks and highlighting the need for strategies that can harness their collective strengths. Ensemble learning methods, combining multiple models to improve overall accuracy and reliability, offer a promising approach to address this challenge.

Objective:

To develop and evaluate efficient ensemble learning approaches, we focus on improving performance across three medical QA datasets through our proposed two ensemble strategies.

Methods:

Our study employs three medical QA datasets: PubMedQA (1,000 manually labeled + 11,269 test yes/no/maybe questions), MedQA-USMLE (12,724 English board-style questions; 1,272 test, five options), and MedMCQA (182,822 training / 4,183 test questions, four-option multiple-choice). We introduced the LLM-Synergy framework, consisting of two ensemble methods: (1) a Boosting-based Weighted Majority Vote ensemble, refining decision-making by adaptively weighting each LLM, and (2) a Cluster-based Dynamic Model Selection ensemble, dynamically selecting optimal LLMs for each query based on question-context embeddings and clustering.

Results:

Both ensemble methods outperformed individual LLMs across all three datasets. Specifically comparing the best individual LLM, the Boosting-based Majority Weighted Vote achieved accuracies of 35.84% on MedMCQA (+3.81%), 96.21% on PubMedQA (+0.64%), and 37.26% (tie) on MedQA-USMLE. The Cluster-based Dynamic Model Selection yields an even higher accuracies of 38.01% (+5.98%) for MedMCQA, 96.36% (+1.09%) for PubMedQA, and 38.13% (+0.87%) for MedQA-USMLE.

Conclusions:

The LLM-Synergy framework, utilizing two ensemble methods, represents a significant advancement in leveraging LLMs for medical QA tasks. Through effectively combining the strengths of diverse LLMs, this framework provides a flexible and efficient strategy adaptable to current and future challenges in biomedical informatics.


 Citation

Please cite as:

Yang H, Li M, Zhou H, Xiao Y, Fang Q, Zhou S, Zhang R

Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study

J Med Internet Res 2025;27:e70080

DOI: 10.2196/70080

PMID: 40658884

PMCID: 12337233

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.