Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Dec 14, 2024
Open Peer Review Period: Dec 15, 2024 - Feb 9, 2025
Date Accepted: May 12, 2025
(closed for review but you can still tweet)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
One LLM is not Enough: Harnessing the Power of Ensemble Learning for Medical Question Answering
ABSTRACT
Background:
Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language processing tasks, including medical question-answering (QA). However, individual LLMs often exhibit varying performance across different medical QA datasets, highlighting the need for strategies that can harness their collective strengths. Ensemble learning methods, which combine multiple models to improve overall accuracy and reliability, offer a promising approach to address this challenge. In this study, we introduce the LLM-Synergy framework, employing two ensemble methods—Boosting-based Weighted Majority Vote and Cluster-based Dynamic Model Selection—to enhance performance across diverse medical QA tasks.
Objective:
To enhance the accuracy and reliability of medical QA tasks by developing efficient ensemble learning approaches deploying the LLM technologies. We focus on improving performance across diverse medical QA datasets through innovative ensemble strategies.
Methods:
Our study employs three medical QA datasets: PubMedQA, MedQA-USMLE, and MedMCQA, each presenting unique challenges in biomedical QA. The proposed LLM-Synergy framework, focusing exclusively on zero-shot LLMs, incorporates two primary ensemble methods. The first is a Boosting-based weighted majority vote ensemble, where decision-making is expedited and refined by assigning variable weights to different LLMs through a boosting algorithm. The second method is Cluster-based Dynamic Model Selection, which dynamically selects the most suitable LLM votes for each query, based on the characteristics of question contexts, utilizing a clustering technique to optimize model selection..
Results:
Both the Majority Weighted Vote and Dynamic Model Selection methods demonstrate superior performance compared to individual LLMs across three medical QA datasets. Specifically, the accuracies are 35.84%, 96.21%, and 37.26% for MedMCQA, PubMedQA, and MedQA-USMLE, respectively, with the Majority Weighted Vote. Correspondingly, the Dynamic Model Selection yields slightly higher accuracies of 38.01% for MedMCQA, 96.36% for PubMedQA, and 38.13% for MedQA-USMLE.
Conclusions:
The LLM-Synergy framework utilizing two ensemble methods, represents a significant advancement in leveraging LLMs for medical QA tasks. This framework provides an innovative and efficient approach to utilizing LLM technologies, enabling customization for both current and potentially future challenges in biomedical and health informatics research.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.