JMIR Preprints #90692: A Comparative Evaluation of Fine-Tuning, Retrieval-Augmented Generation, and Hybrid Large Language Models for Postoperative Clinical Decision Support

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

A Comparative Evaluation of Fine-Tuning, Retrieval-Augmented Generation, and Hybrid Large Language Models for Postoperative Clinical Decision Support

Srinivasagam Prabha;
Bernardo Gabriele Collaco;
Cesar Abraham Gomez-Cabello;
Syed Ali Haider;
Ariana Genovese;
Zhihui Fang;
Nadia Wood;
Sanjay Bagaria;
Cui Tao;
Antonio Jorge Forte

ABSTRACT

Background:

Large language models (LLMs) have shown growing potential for clinical decision support. However, effectively integrating domain-specific medical knowledge into LLMs while maintaining accuracy, safety, and interpretability remains a key challenge for postoperative discharge instructions and patient education. Fine-tuning (FT), retrieval-augmented generation (RAG), and hybrid FT+RAG approaches represent three prominent strategies for knowledge integration, yet their comparative performance in postoperative clinical contexts has not been systematically evaluated.

Objective:

We aimed to compare the clinical performance, reliability, and safety characteristics of baseline, fine-tuned, retrieval-augmented, and hybrid FT+RAG LLM configurations for postoperative clinical decision support.

Methods:

We conducted a controlled comparative evaluation of four LLM configurations using Google Gemini 2.5 Flash. A total of 600 postoperative question–answer pairs were used for model adaptation and validation, while 150 queries were reserved for final evaluation. Queries included routine postoperative care questions, emergency escalation scenarios, and deliberately out-of-scope questions. Model outputs were independently assessed by three blinded clinical experts for accuracy, completeness, and relevance. Automated metrics were used to evaluate readability, faithfulness, and hallucination propensity.

Results:

All knowledge-enhanced models significantly outperformed the baseline model in clinical accuracy (baseline 68.0% vs FT 92.7%, RAG 91.3%, FT+RAG 97.3%; p<.001). The hybrid FT+RAG model achieved the highest overall performance, including 100% precision, 96.7% recall, and the lowest hallucination rate. FT and RAG alone yielded comparable gains across accuracy, completeness, relevance, faithfulness, and hallucination reduction, with no statistically significant differences between them. While enhanced models produced shorter and more concise responses, they demonstrated reduced readability compared with the baseline model.

Conclusions:

Incorporating domain knowledge substantially improves the clinical performance of LLMs for postoperative decision support. Hybrid FT+RAG approaches provide the strongest overall accuracy and safety profile, although trade-offs in readability, interpretability, and rater variability remain. These findings support the use of knowledge-augmented LLMs in postoperative care while underscoring the need for careful governance, transparency, and human oversight prior to clinical deployment. Clinical Trial: Not applicable

Citation

Please cite as:

Prabha S, Collaco BG, Gomez-Cabello CA, Haider SA, Genovese A, Fang Z, Wood N, Bagaria S, Tao C, Forte AJ

A Comparative Evaluation of Fine-Tuning, Retrieval-Augmented Generation, and Hybrid Large Language Models for Postoperative Clinical Decision Support

JMIR Preprints. 01/01/2026:90692

DOI: 10.2196/preprints.90692

URL: https://preprints.jmir.org/preprint/90692

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Currently submitted to: Journal of Medical Internet Research

Date Submitted: Jan 1, 2026

Open Peer Review Period: Jan 1, 2026 - Feb 26, 2026

(closed for review but you can still tweet)

NOTE: This is an unreviewed Preprint

A Comparative Evaluation of Fine-Tuning, Retrieval-Augmented Generation, and Hybrid Large Language Models for Postoperative Clinical Decision Support

ABSTRACT

Citation

Copyright