Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Currently accepted at: Journal of Medical Internet Research

Date Submitted: Jan 1, 2026
Open Peer Review Period: Jan 1, 2026 - Feb 26, 2026
Date Accepted: Jun 10, 2026
Date Submitted to PubMed: Jun 19, 2026
(closed for review but you can still tweet)

This paper has been accepted and is currently in production.

It will appear shortly on 10.2196/90692

The final accepted version (not copyedited yet) is in this tab.

An "ahead-of-print" version has been submitted to Pubmed, see PMID: 42321146

Fine-Tuning, Retrieval-Augmented Generation, and Hybrid Large Language Models for Postoperative Decision Support: A Comparative Analysis

  • Srinivasagam Prabha; 
  • Bernardo Gabriele Collaco; 
  • Cesar Abraham Gomez-Cabello; 
  • Syed Ali Haider; 
  • Ariana Genovese; 
  • Zhihui Fang; 
  • Nadia Wood; 
  • Sanjay Bagaria; 
  • Cui Tao; 
  • Antonio Jorge Forte

ABSTRACT

Background:

Large language models (LLMs) have shown growing potential for clinical decision support. However, effectively integrating domain-specific medical knowledge into LLMs while maintaining accuracy, safety, and interpretability remains a key challenge for postoperative discharge instructions and patient education. Fine-tuning (FT), retrieval-augmented generation (RAG), and hybrid FT+RAG approaches represent three prominent strategies for knowledge integration, yet their comparative performance in postoperative clinical contexts has not been systematically evaluated.

Objective:

We aimed to compare the clinical performance, reliability, and safety characteristics of baseline, fine-tuned, retrieval-augmented, and hybrid FT+RAG LLM configurations for postoperative clinical decision support.

Methods:

We conducted a controlled comparative evaluation of four LLM configurations using Google Gemini 2.5 Flash. A total of 600 postoperative question–answer pairs were used for model adaptation and validation, while 150 queries were reserved for final evaluation. Queries included routine postoperative care questions, emergency escalation scenarios, and deliberately out-of-scope questions. Model outputs were independently assessed by three blinded clinical experts for accuracy, completeness, and relevance. Automated metrics were used to evaluate readability, faithfulness, and hallucination propensity.

Results:

All knowledge-enhanced models significantly outperformed the baseline model in clinical accuracy (baseline 68.0% vs FT 92.7%, RAG 91.3%, FT+RAG 97.3%; p<.001). The hybrid FT+RAG model achieved the highest overall performance, including 100% precision, 96.7% recall, and the lowest hallucination rate. FT and RAG alone yielded comparable gains across accuracy, completeness, relevance, faithfulness, and hallucination reduction, with no statistically significant differences between them. While enhanced models produced shorter and more concise responses, they demonstrated reduced readability compared with the baseline model.

Conclusions:

Incorporating domain knowledge substantially improves the clinical performance of LLMs for postoperative decision support. Hybrid FT+RAG approaches provide the strongest overall accuracy and safety profile, although trade-offs in readability, interpretability, and rater variability remain. These findings support the use of knowledge-augmented LLMs in postoperative care while underscoring the need for careful governance, transparency, and human oversight prior to clinical deployment. Clinical Trial: Not applicable


 Citation

Please cite as:

Prabha S, Collaco BG, Gomez-Cabello CA, Haider SA, Genovese A, Fang Z, Wood N, Bagaria S, Tao C, Forte AJ

Fine-Tuning, Retrieval-Augmented Generation, and Hybrid Large Language Models for Postoperative Decision Support: A Comparative Analysis

Journal of Medical Internet Research. 10/06/2026:90692 (forthcoming/in press)

DOI: 10.2196/90692

URL: https://preprints.jmir.org/preprint/90692

PMID: 42321146

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.