Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Comparative Evaluation of Commercial and Open-Source Large Language Models in a RAG-Based Chatbot for Antimicrobial Resistance Literature Analysis
ABSTRACT
Background:
Antimicrobial resistance (AMR) poses a critical global health threat, undermining the efficacy of antibiotics and complicating clinical decision-making. Although scientific literature on AMR is extensive, retrieving and synthesizing relevant evidence remains time-consuming for clinicians and researchers. Recent advances in large language models (LLMs) offer opportunities to enhance access to domain-specific knowledge. However, the diversity of available models, ranging from open-source to commercial, necessitates a systematic comparison of their performance, cost, and scalability in real-world biomedical applications.
Objective:
This study describes the development of a Retrieval-Augmented Generation (RAG) chatbot for AMR literature analysis and compares multiple commercial and open-source LLMs in terms of accuracy, faithfulness, response time, and cost-efficiency.
Methods:
A corpus of 164 peer-reviewed AMR-related articles was compiled from Google Scholar and embedded into a ChromaDB vector database using OpenAI’s text-embedding-ada-002. The RAG chatbot was configured to run with five LLM backbones: GPT-4, GPT-4o, GPT-4o-mini, Claude 3.7 Sonnet, and LLaMA 4 Maverick. For each model, a temperature ablation study was performed to identify optimal performance. Evaluation metrics included correctness (pass rate and score), faithfulness, relevance, computational cost, and latency, using a synthetic ground truth dataset generated with GPT-4.
Results:
All models generated scientifically justified responses when integrated into the RAG framework. GPT-4 achieved the highest correctness score (94.7%) but incurred the highest cost, while GPT-4o delivered nearly identical accuracy at a ninefold lower cost and the fastest response time (3.88 s). LLaMA 4 Maverick and GPT-4o-mini offered lower accuracy but substantially reduced operational costs. Claude 3.7 Sonnet showed competitive accuracy but the least favorable cost-performance ratio. Qualitative analysis revealed differences in response style, detail, and structure among models.
Conclusions:
A RAG-based chatbot can effectively support AMR research by delivering accurate, context-grounded, and scalable access to scientific literature. The comparative evaluation highlights trade-offs between performance, cost, and speed, guiding the selection of LLM architectures for clinical and research settings. Future work will focus on integrating language-specific embeddings and specialized domain agents to further enhance accuracy, adaptability, and clinical utility.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.