JMIR Preprints #83206: Comparative Evaluation of Commercial and Open-Source Large Language Models in a RAG-Based Chatbot for Antimicrobial Resistance Literature Analysis

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Comparative Evaluation of Commercial and Open-Source Large Language Models in a RAG-Based Chatbot for Antimicrobial Resistance Literature Analysis

Oscar Escudero-Arnanz;
Manuel Eduardo Valero-Méndez;
Noelia Sánchez-Ramos;
Cristina Soguero-Ruíz

ABSTRACT

Background:

Antimicrobial resistance (AMR) poses a critical global health threat, undermining the efficacy of antibiotics and complicating clinical decision-making. Although scientific literature on AMR is extensive, retrieving and synthesizing relevant evidence remains time-consuming for clinicians and researchers. Recent advances in large language models (LLMs) offer opportunities to enhance access to domain-specific knowledge. However, the diversity of available models, ranging from open-source to commercial, necessitates a systematic comparison of their performance, cost, and scalability in real-world biomedical applications.

Objective:

This study describes the development of a Retrieval-Augmented Generation (RAG) chatbot for AMR literature analysis and compares multiple commercial and open-source LLMs in terms of accuracy, faithfulness, response time, and cost-efficiency.

Methods:

A corpus of 164 peer-reviewed AMR-related articles was compiled from Google Scholar and embedded into a ChromaDB vector database using OpenAI’s text-embedding-ada-002. The RAG chatbot was configured to run with five LLM backbones: GPT-4, GPT-4o, GPT-4o-mini, Claude 3.7 Sonnet, and LLaMA 4 Maverick. For each model, a temperature ablation study was performed to identify optimal performance. Evaluation metrics included correctness (pass rate and score), faithfulness, relevance, computational cost, and latency, using a synthetic ground truth dataset generated with GPT-4.

Results:

All models generated scientifically justified responses when integrated into the RAG framework. GPT-4 achieved the highest correctness score (94.7%) but incurred the highest cost, while GPT-4o delivered nearly identical accuracy at a ninefold lower cost and the fastest response time (3.88 s). LLaMA 4 Maverick and GPT-4o-mini offered lower accuracy but substantially reduced operational costs. Claude 3.7 Sonnet showed competitive accuracy but the least favorable cost-performance ratio. Qualitative analysis revealed differences in response style, detail, and structure among models.

Conclusions:

A RAG-based chatbot can effectively support AMR research by delivering accurate, context-grounded, and scalable access to scientific literature. The comparative evaluation highlights trade-offs between performance, cost, and speed, guiding the selection of LLM architectures for clinical and research settings. Future work will focus on integrating language-specific embeddings and specialized domain agents to further enhance accuracy, adaptability, and clinical utility.

Citation

Please cite as:

Escudero-Arnanz O, Valero-Méndez ME, Sánchez-Ramos N, Soguero-Ruíz C

Evaluation of a Retrieval-Augmented Generation Chatbot for Antimicrobial Resistance Research: Comparative Analysis of Large Language Models

JMIR AI 2026;5:e83206

DOI: 10.2196/83206

PMID: 41875309

PMCID: 13012410

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR AI

Date Submitted: Aug 29, 2025

Date Accepted: Oct 24, 2025

Comparative Evaluation of Commercial and Open-Source Large Language Models in a RAG-Based Chatbot for Antimicrobial Resistance Literature Analysis

ABSTRACT

Citation

Copyright