Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR AI

Date Submitted: Aug 29, 2025
Date Accepted: Oct 24, 2025

The final, peer-reviewed published version of this preprint can be found here:

Evaluation of a Retrieval-Augmented Generation Chatbot for Antimicrobial Resistance Research: Comparative Analysis of Large Language Models

Escudero-Arnanz O, Valero-Méndez ME, Sánchez-Ramos N, Soguero-Ruíz C

Evaluation of a Retrieval-Augmented Generation Chatbot for Antimicrobial Resistance Research: Comparative Analysis of Large Language Models

JMIR AI 2026;5:e83206

DOI: 10.2196/83206

PMID: 41875309

Evaluation of a Retrieval-Augmented Generation Chatbot for Antimicrobial Resistance Research: Comparative Analysis of Large Language Models

  • Oscar Escudero-Arnanz; 
  • Manuel Eduardo Valero-Méndez; 
  • Noelia Sánchez-Ramos; 
  • Cristina Soguero-Ruíz

ABSTRACT

Background:

Antimicrobial resistance (AMR) poses a critical global health threat, undermining the efficacy of antibiotics and complicating clinical decision-making. Although scientific literature on AMR is extensive, retrieving and synthesizing relevant evidence remains time-consuming for clinicians and researchers. Recent advances in large language models (LLMs) offer opportunities to enhance access to domain-specific knowledge. However, the diversity of available models, ranging from open-source to commercial, necessitates a systematic comparison of their performance, cost, and scalability in real-world biomedical applications.

Objective:

This study describes the development of a Retrieval-Augmented Generation (RAG) chatbot for AMR literature analysis and compares multiple commercial and open-source LLMs in terms of accuracy, faithfulness, response time, and cost-efficiency.

Methods:

A corpus of 164 peer-reviewed AMR-related articles was compiled from Google Scholar and embedded into a ChromaDB vector database using OpenAI’s text-embedding-ada-002. The RAG chatbot was configured to run with five LLM backbones: GPT-4, GPT-4o, GPT-4o-mini, Claude 3.7 Sonnet, and LLaMA 4 Maverick. For each model, a temperature ablation study was performed to identify optimal performance. Evaluation metrics included correctness (pass rate and score), faithfulness, relevance, computational cost, and latency, using a synthetic ground truth dataset generated with GPT-4.

Results:

All models generated scientifically justified responses when integrated into the RAG framework. GPT-4 achieved the highest correctness score (94.7%) but incurred the highest cost, while GPT-4o delivered nearly identical accuracy at a ninefold lower cost and the fastest response time (3.88 s). LLaMA 4 Maverick and GPT-4o-mini offered lower accuracy but substantially reduced operational costs. Claude 3.7 Sonnet showed competitive accuracy but the least favorable cost-performance ratio. Qualitative analysis revealed differences in response style, detail, and structure among models.

Conclusions:

A RAG-based chatbot can effectively support AMR research by delivering accurate, context-grounded, and scalable access to scientific literature. The comparative evaluation highlights trade-offs between performance, cost, and speed, guiding the selection of LLM architectures for clinical and research settings. Future work will focus on integrating language-specific embeddings and specialized domain agents to further enhance accuracy, adaptability, and clinical utility.


 Citation

Please cite as:

Escudero-Arnanz O, Valero-Méndez ME, Sánchez-Ramos N, Soguero-Ruíz C

Evaluation of a Retrieval-Augmented Generation Chatbot for Antimicrobial Resistance Research: Comparative Analysis of Large Language Models

JMIR AI 2026;5:e83206

DOI: 10.2196/83206

PMID: 41875309

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.